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Preface 


NONPARAMETRIC STATISTICS IS CONCERNED WITH THE TREATMENT 
of standard statistical problems when the familiar assumption of nor- 
mality is replaced by general assumptions concerning the distribution 
form. One of the oldest nonparametric methods is Karl Pearson's 
x’-test of fit proposed in 1900 (Reference [1], Chapter 3). Another is 
the classical sign test. Development of nonparametric methods was 
slow until the second war years, but since then their growth has 
touched almost every phase of statistical activity. This book is an 
attempt to collect and unify these diverse developments. Preliminary 
to this, the first two chapters provide a survey of the general tech- 
niques of estimation and hypothesis testing. 

The book is intended as a second course in mathematical statistics. 
Prerequisites are a knowledge of calculus and familiarity with an 
introduction to statistics such as is found in Hoel—Jntroduction to 
Mathematical Statistics. A knowledge of measure theory is not neces- 
sary, the essential ideas of measure being introduced with a statistical 
interpretation. The first two chapters are used at the University of 
Toronto for an undergraduate course surveying recent small sample 
methods, while the remainder of the book ‘covers material for a grad- 
uate course on the applications of these methods in the nonparametric 
branch of statistics. 

Many of the names of contributors to the development of nonpara- 
metric theory appear in the references and bibliography at the end of 
each chapter. In particular, Professor Erich Lehmann has con- 
tributed much by collecting the results of his own and others’ re- 
searches in his two sets of mimeographed notes on estimation and 
hypothesis testing. 

Completion of this book was made possible by the generous support 
of the United States Office of Naval Research during the spring of 

vil 
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1955. At that time I enjoyed a visiting appointment at Princeton 


University, and I wish to express my appreciation to Professor Wilks 
and Professor Tukey for their encouragement. 
D. A. S. FRASER 
University of Toronto 
October 1956 
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CHAPTER 1 


Probability Concepts 


1. INTRODUCTION 


In Chapter 1 we introduce the notion of probability and discuss some 
concepts that are naturally associated with probability. In Chapter 2 we 
add the necessary mathematical structure for statistical decisions and 
treat in some detail the methods of estimation, hypothesis testing, con- 
fidence regions, and tolerance regions. In these two chapters the purpose 
is to have a treatment of the more important ideas of general statistical 
inference and therefore to enable the development of the remaining 
chapters with the emphasis on the methods particular to the field of 
nonparametric theory. 


2. MEASURABLE SPACE 


In constructing a probability model for an experiment, the first step is a 
consideration of what are the possible outcomes of the experiment. We 
assume that all possibilities for the outcome can be foreseen, and we refer 
to this aggregation of outcomes as the sample space and designate it by a 
capital script letter, say 2. An arbitrary outcome or point of this space is 
designated by x, of the space Y byy. As an example consider the tossing 
ofasingle coin. The outcome could perhaps be considered in some detail, 
but usually interest is restricted to a description of what face shows after 
the toss. Accordingly, if we designate the two outcomes in the obvious 
manner, we have 2 = {H, T}; or if we envisage the possibility of the 
coin standing on edge, then we have Z = {H, T, E}. As another example 
consider a sequence of five measurements of the gravitational constant. 
An outcome then is a sequence or ordered set of five numbers, 
X = (a,,+++, a5), and the sample space could be the Euclidean space of 
five dimensions, 2 = R°. Of course, not all points in Rë are possible 
outcomes, but what is essential is that Z contain all possible outcomes. 
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The probability model ofan experiment is based on acertain phenomenon 
observed, not in a single performance of the experiment, but in a long 
sequence of repetitions under as nearly identical conditions as possible. 
The phenomenon is concerned with the frequency with which the outcome 
falls in any particular subset of the space Z. It would be satisfying if the 
later development of the mathematical model would permit us to consider 
the phenomenon, and its analog in the model, for arbitrary subsets A of Z. 
However, in general this is not possible. Accordingly, we associate with 
the space Z a class Z of subsets A of 2’, these subsets to be referred to as 
measurable sets. We shall impose some natural restrictions on the class 
£, but first we introduce some notation. 

If x is a point in (or element of) 4, we write x € A; and if xis not in A, 
we write x¢A. If each point in a set A is also in a set A’, we say A is 
contained in A’and write A c A’. The set consisting of each point either in 
A, or in A, (or in both) is called A, union A, and is designated by A, U Ag. 

n 


UA; is the set consisting of all points belonging to any of the sets 
i=1 
Art, An AN Ag is the set of points that belong to both A, and Ap, 
n 
and is called the intersection of A, and Ay. Similarly N A; is the set of 
i=1 


points belonging to each of Aj, A2 ***. A, — Ag is the set of points 
belonging to A, but not to A, and is called the difference. A particular 
difference is 2 — A and is called the complement of A. {z| condition} is 
the set of points satisfying the condition following the vertical bar; for 
example, we can write A; — Ay = {x|x e A,, x € Ag}. 

The restrictions imposed on the class «2% are such that, if we apply the 
simple operations of union, intersection, and complementation to sets in 
Z, then the resulting sets will also be in 7. We require that Z be a 
o-algebra of subsets of 2’, that is, . be nonempty and satisfy: 


(i) If Ay, Ag, +++ © A, then U A, € A. 
i=1 
(ii) IAEA, then ® —Ae xf. 


It is easy to show that these conditions imply that Z e Z, and that, if 
A, Ag, +++ E A, then N 4; € . See Problems 1 and 2. 
i=1 

The combination of a space 2 and a o-algebra Z of subsets of Z is 
called a measurable space and is designated by Z(.°7). The term measur- 
able is used merely to indicate that this is the structure upon which a 
measure or probability measure can be defined. (We shall consider 
measures in the next section.) In many cases the classes will not be 
mentioned explicitly. For example, if 2 consists of a finite number of 


1.2] MEASURABLE SPACE 3 


points, then «Z will always be the o-algebra consisting of all subsets of 2. 
Also, if Z = R”, the class © will almost invariably be the class of Borel sets 
or the class of Lebesgue measurable sets. We define the Borel sets when 
n= l; an analogous definition applies when n > 1. The class of Borel 
sets is the smallest o-algebra containing the intervals [a, b] = {x |a < & < b} 
for all a, b. Problem 3 shows that it is meaningful to speak of a smallest 
o-algebra. Lebesgue measurable sets will be defined in the next section. 
In the sequel some familiarity with Lebesgue measure and integration 
would be helpful, and for further reading see Monroe [1] and Halmos [2]. 

Frequently a number of experiments are considered simultaneously, in 
which case the over-all outcome is the sequence of outcomes from the 
individual experiments. Therefore, we define a product space 
2, X- XZ, it consists of all n-tuples (z, ***, 2) where 2,62), 
MER, +++,x,ER,. If each 2; is a measurable space 2(7;), we 
now consider whether there is a natural o-algebra 2 to be associated with 
the product space. If A; e.Z; for i= 1,*++,n, then we might well be 
interested in whether the over-all outcome (2, * * +, x) was such that each 
coordinate 2; was in the corresponding set A;: that is, whether 
(2, +++, tn) EA X +++ X A, Accordingly, we require that Z contain 
all the product or rectangular sets, 


Ay X00 X Ag = (i+ Bn) |B, EAL (i= 1,+++, n)}, 


and, in fact, we define the natural o-algebra Z on the product space to be 
the smallest o-algebra containing all the product sets A, X +++ X A, and 
we designate it by Z = (Z4, * `, Zn). See Problems 4 and 5. 

In analyzing an experiment the statistician frequently does not need to 
consider the outcome as such with all the detailed information it contains, 
but may be content with a condensation or function of it. We now 
consider formally the procedure of condensing the outcome of an experi- 
ment. Let 2(/) and F(A) be two measurable spaces corresponding, 
respectively, to the outcome and its condensation. Then for the statistician 
to obtain a condensation he needs a function, say t(x), which maps Z into 
T. Thus, for each x e %, t(x) is a unique point in 7. 

For the space 2(.9/) we confined our attention to the sets belonging to 
J. Correspondingly we have a natural restriction to impose on the 
function ¢(x). For, if the statistician investigates the frequency with which 
the condensation f(x) falls in a set BEB, he could just as well have 
observed the frequency with which the original outcome falls in a set 
A defined by r 
21) atn) 


= {x|t(2) € B}; 
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A is the inverse image of B under the mapping ¢(x) and consists of all 
points that are mapped into B by ¢(x). Since we have confined the 
statistician’s attention to measurable sets, then, if he considers a set 
B e &, certainly the inverse image set A = t—(B) should belong to X. 


This is our requirement on ¢(x), and a function satisfying this requirement 
we call a measurable function or statistic. 


A function t(x) from 2( 2) to F(Z) is a statistic if, for every B € 2, 
(B) E Z. 


Sometimes we shall have a function z(x) from a measurable space (2) 
into a space 7 and would like to define measurable sets on Z so that r(x) 
is measurable, is a statistic. First we make the trivial restriction that 7 
consist only of points obtained under the mapping (x); ie. 7 = (2). 
We define a class 4* to consist of all sets B c F whose inverse images 
t-1(B) are elements of «7. 


(2.2) B* = {B| (B) € 2}. 


As soon as we have verified that Z* is a ø-algebra on Z, it is obvious that 
t(x) is a statistic from Z(2) to 7(2*). The proof that Z* is a o-algebra 
(Problem 7) is easily obtained from the following relations: 


(2.3) UB) = UB), 
(2.4) CNB.) = N(B), 


which are given for proof in Problem 6. This o-algebra Z* is, in fact, the 
largest o-algebra on Z for which the function t(x) is measurable. 

In condensing the outcome of an experiment, the important thing to a 
Statistician is a knowledge of which 2’s will produce a given value of the 
function and not the particular designation or name for that value. Thus, 
for n real numbers, %,***,x,, the sample mean Z is just as useful as 1/24 
or Zw; and conversely. The value of each can be obtained from the value 
of any other. Thus from some points of view the essential idea of 


Statistic for the statistician is given by a partition of the space Z into 
disjoints sets T(x): 


z= UTE) 
where T(x) is the set containing a, and for any two points x and 2’ either 
T(x) = T(x’) 


or 
Tx) N Tæ’) = Ø 


where ) is the empty set. The statistician obtains his condensation by 
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recording the set T(x) which contained the outcome x and forgetting which 
x in the set T(x) was the actual outcome. The space 7 for the mapping 
T(x) has as elements subsets of the space 2. A set B will belong to the 
o-algebra Z* induced on 7 if, when its elements are considered as subsets 
of Z and their union is taken, the result is a subset belonging to Z. 


3. MEASURES AND PROBABILITY MEASURES 


In Section 2 we described the aggregation of outcomes of an experiment 
by the sample space 2’, and the class of subsets of & to which we restrict 
attention by the o-algebra «Z. In this section we introduce the idea of 
probability. 

If in a series of repetitions of an experiment we observe a frequency 
ratio, the proportion of times that the outcome 2 falls in a set A c F, 
then it is an empirical result that in many types of experiments this ratio 
becomes quite stable as the number of repetitions is increased. Experi- 
ments possessing this property are called random experiments. The 
stability of the frequency ratio naturally leads us to hypothesize the 
existence of a number or probability to be associated with that set A. This 
number is the value about which the frequency ratio seems to settle in an 
extended series of repetitions. 

Thus, to each set A € Z, we can associate a number or probability. 
We require that these probabilities obey some simple rules corresponding 
to rules obeyed by the frequency ratio. We have, then, an example of a 
measure: For each set of Z there is a number which measures the 
“probability” that an outcome falls in that set, a number which represents 
the frequency ratio for that set in a long series of repetitions of the 


experiment. = ; 
Before talking in detail about probabilities we introduce the general 


idea of a measure. 
(A) is a measure over F(A) if 
(i) For each A € %, uA) is a real number. 


(ii) For each A € £, (A) > 0. 
(iii) For Ay, Ag, ***, which are disjoint} and belong to sf, 


uÚ 4) => HAD. 


i=1 


f Sets are disjoint if there are no points that belong to more than one set. 
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Because Z is a o-algebra, Z €.. Then, from conditions (i), (ii), it 
follows that (2) exists and is a non-negative real number; also from (iii) 
it follows that 


MA) + UZ — A) = u(Z), 


where each term is non-negative. Conditions (i), (ii), (iii) thus imply that 
the values of u(4) are bounded by (2). More precisely then, we should 
call a function satisfying the conditions above a bounded measure. We 


obtain the definition of an unbounded measure by relaxing the first con- 
dition (i) above to 


(i)’ For each A € £4, (A) is a real number or +o. 


As an example of an unbounded measure, consider Lebesgue measure 
defined over the Boel sets on the real line RI. If we associate with any 
interval [a, b] its length b — a, we have then a measure of that interval. 
But it is obvious that the intervals do not form a o-algebra. The smallest 
o-algebra containing the intervals is the class of Borel sets defined in the 
previous section. Now it can be shown that the definition of a measure 
over a class of sets together with the conditions (i)’, (ii), (iii) uniquely defines 
a measure for all sets in the smallest o-algebra containing this class, and of 
course this measure agrees with the given measure for sets in the original 
class. This is called extending a measure, and for the details the reader is 
referred to Cramér [3], p. 19 and Halmos [2], p. 54. Lebesgue measure 
then is the extension to the Borel sets of the simple measure of an interval 
as given by its length.t 

For (A) tobea probability measure we add the one additional condition: 


(iv) W(@)=1. 


However, when a measure is a probability measure, we shall usually 
designate it by P(A) rather than A). Itis to be carefully noted that each 
of the four conditions or axioms for a probability measure corresponds to a 
similar condition satisfied by the frequency ratio. 

As a simple example, consider the tossing of two unbiased coins. 
We can designate the possible outcomes by HH, HT, TH, TT, where for 
example HT stands for heads for the first coin and tails for the second. 


TX = {HH, HT, TH, TT} 


t If we add to the Borel sets an 
zero and then take the smalles 
The Lebesgue measure extend: 
the completion of a o-algebra 


y Set contained in a Borel set having Lebesgue measure 
t -algebra, we obtain the Lebesgue measurable sets. 


Is uniquely to this larger algebra. This is an example of 
with respect to a measure. 


— 
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The probability measure that seems to fit experimental results best is given 
by 

P(HH) = 1/4, P(HT) = 1/4, 

P(TH) = 1/4, P(TT) = 1/4; 


the measure of a set containing more than one point is obtained by applying 
condition (iii) to the values in the above equations. 

If X is a Euclidean space R”, it is frequently more convenient to 
describe the probability measure by means of a distribution function 
F(%,,+++, €n) defined by 


(3.1) Flap, Eq) = P(e) [GS = 1,>*+,n)} 
= P{]—oo, z] X +++ X]—oo, x,]} 


where Ja, b] is the interval from a to b, open on the left and closed on the 
Tight. Now it is quite easy to show that F(z, ***, #,) satisfies the three 


conditions: 
(i)* F(a, ++ +, Bia, O; Very s Tn) = 0. 
Gi) F(-+00, +++, +00) = 1. 


(iii)* A, (ay by) +++ Ay, Gn» Sn) FQ °°» En) = 0 
' fora <b; (i= 1,9, n). 


The operator A, (a, b) is defined by 


A, (a, b) Flay +++, En) = Fp t s Gav b, Zigi * s Cn) 
= For i Oy E ga 


It is quite easy to show that the expression occurring in condition (iii)* is 


the probability 
Pit by] X +++ XIla bal): 


Conversely, a distribution function F(x, ***, %,) satisfying (i)*, (ii)*, 
(iii)* uniquely determines a probability measure P(A) defined for the Borel 
sets of R”. The proof of this provides another example of the extension 
of a measure. We note that F(2,,°* +, %,) by virtue of (iii)* provides a 
non-negative measure for every rectangular set of the form Ja,, by] x 
"++ X Jap b,a]. Then, just as the Borel sets are obtained from these 
rectangular sets, so also is the probability measure for the Borel sets. 

Often we shall want to construct a measure on a product space 
2, x +++ Z, from measures u; on the component spaces 2'(.x/;). 
In general, this can be done in many ways, but we consider now a natural 
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way corresponding in statistics to independence between the experiments 
recorded in the components spaces 2’;. 


The product measure (A) on Yı X +++ X Z, obtained from the 
measures (41), ***, [4,(A,), respectively, on 2y,+-++, Zp is given by 
(333) MA X +++ X An) = [I uA) 


i=l 

for all A;,** `, An belonging, respectively, to £1, * +, Ap 
This definition for product sets together with the conditions for a pro- 
bability measure uniquely determines the measure y(A) for all sets 
Aé(%,,:++, %,), the natural o-algebra on the product space. This is 
yet another example of the extensien of a measure. 

If the measures yz, are probability measures, then quite obviously ju is 
also a probability measure. For probability measures the above definition 
has the following meaning in terms of frequency ratios. If there are 7 
experiments with the outcomes recorded, respectively, in the ” spaces 
Zitte Za and if there is no dependence or connection between the 
experiments, then the frequency ratio for the set A} X «++ X Ap, is found 
in a long sequence of repetitions of the combined experiment to 
approximate the product of the frequency ratios of the individual sets A,. 
This phenomenon we refer to as statistical independence and hypothesize 
that the probabilities give an exact equality; this is the definition above. 

Let «(x) be a statistic from Z(.7/) into Z (2). If t(x) is used to 
condense the information in the outcome « for an experiment over Z, and 
if P(A) is the probability measure descriptive of the outcomes in a sequence 
of repetitions, it is natural to inquire what probability measure governs 
the values of r(x) in a sequence of repetitions. In analogy with an equi- 
valent relation for the frequency ratio we define a probability measure Q(B) 
over 7 (B) induced by the probability measure P(A) over 2(.9/). 


(3.3) Q(B) = P(t-(B); 


that is, the probability measure for (x) in B is the probability measure of 
the set of all outcomes mapped into B by (x). Of course, for the defini- 
tion to be meaningful it must be shown that Q(B) is really a probability 
measure and satisfies the four conditions; this is given as Problem 14. 

The induced probability measure Q(B) of the statistic t(x) is sometimes 
called the marginal probability measure. The reason is that in some 
simple examples the outcomes can be arranged in a table with rows and 
columns and a given value of a statistic might correspond to all the out- 
comes in a row. Then the probability for that value would be obtained 


by adding the probabilities in the row and perhaps placing the total 
alongside “in the margin.” 


1.4] EXPECTATION AND CONDITIONAL PROBABILITIES 9 


The term random variable is quite inessential to the mathematical model, 
but, on occasion, it can lead to considerable convenience of expression. 
We therefore define: 


X, a random variable, is a symbol for a measurable space 2(.) and 
probability measure P(A) over (7). 


The symbol X can be associated with the result of a future performance of 
the experiment, the future outcome. However, the only way to give any 
meaning in the model to such a statement is by means of the probability 
measure, and for this the “random variable” is superfluous. The con- 
venience is that we can use the expression /(X) to stand for the random 
variable of the induced distribution. Then we have a symbolic analog of 
X an outcome from the experiment on Z and 1(X’) an outcome for the 
condensation by the statistic (x). If we wish a single symbol for the 
random variable ¢(¥), we shall capitalize the letter giving the statistic 


and write 
T= (X). 


It is also convenient to speak of the probability that a random variable 
fulfills a condition and to write Pr {X condition}. By this we mean the 
probability measure of the set of points fulfilling the condition, and, for 
example, we would have 


Pr{Xed, XEA'}= P(A N A’). 


The term probability distribution is also frequently used in statistics. 
It is a general term for a “distribution of probability” and does not refer 
explicitly to the measure that defines the “distribution” or if the space is 


Euclidean to the related distribution function. 


4. EXPECTATION AND CONDITIONAL PROBABILITIES 


To introduce expectation and conditional probability we need to define 
integration with respect to a measure u(A). One of the standard defini- 
tions for the Lebesgue integral can be extended in a straightforward 
manner. Also many of the properties of the Lebesgue integral carry over 
to the more general form. The derivation of some of these properties is 
given as problems in Section 8. 

Let f(x) be a real-valued statistic over 2(.), and assume for the 
moment that it is bounded, B < f(x) < C. Also, let (A) be a bounded 
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measure over Z (4). Then we define the integral of f (x) with respect to 
H(A) by 


(4.1) | f (x) du(x) = lim > M; w{x| M; < f (£) < Mj}. 
Jz peace =e 


where B= Mọ < M; <:°:<M,=C,M,;— M; < € for each i, and 
e is taken to zero as n—> co. It is straightforward to prove that the limit 
exists; see Problem 15. The integral can be viewed approximately as a 
sum of values of f (x) weighted with the measure of the points giving those 
values to the function; and, if (A) is a probability measure, as a weighted 
average of values of the function. 

The definition of the integral can be extended to unbounded functions. 
Let f (x) be an unbounded function and f,,(«) be f (x) altered to satisfy the 
lower and upper bounds B, C: 


fece) = C Fa <f 
= f(x) if B <f(#y<C 
= 2 if f@ <2: 


Then if [rece du(x) approaches a finite limit as B— —oo and C 
separately —> +00, then the integral of f (x) is said to exist and 


[SO duce oe, | Foc tute. 


The integral over a measurable subset A of Z can be obtained directly 
from the definition of the integral over 2. For this we define the 
characteristic function ġ4(x) of the set A: 

(4.2) $4(x) = 1 if «eA 
=0 € A. 
Obviously 44(x) is measurable. From the result in Problem 16, ¢.4(x) f (x) 


is also measurable. Then we define the integral of f (£) over the measurable 
set A by 


(43) [roo = | 7) 4.) duo. 


It is sometimes possible to give a definition of the integral with respect to 
an unbounded measure (A). Suppose we can find a monotone sequence 


of sets A, C A, C Ag+ ++ such that Z = UA, and y(A,) < œ for each i. 
i=1 
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Then within each set A; the measure (A) is bounded, and the integral 
over A, can be defined. If the limit, 


sim | FO duce, 


exists and is independent of the sequence A, Æo, ***, then that limit is 
defined to be the value of the integral 


É fŒ) du(z). 


If X is a random variable for P(A) over Z (s2) and f (x) is a real-valued 
statistic, then we define the expectation of f (X) to be 


(4.4) EF} = | feo aPC) 


if the expression exists; otherwise we say the expectation does not exist. 

If probabilities are visualized as frequency ratios for an extended sequence 

of repetitions of an experiment, this expectation then appears as the average 

of the values of f(x) for that sequence. Kolmogorov’s theorem to be 

mentioned later adds theoretical weight to this experimental interpretation. 

For, if f (X1), f (X2), ** + is the sequence obtained by repeating the experi- 
n 


ment, then with probability one the average n~! > J (X,) converges to 
E{f(X)} as n— oœ. j=1 

The random variable f (X) in formula (4.4) is a real-valued random 
variable, and hence has a distribution on the real line, R'. If we designate 
the random variable f (X) by Y and its measure by Q(B), then we could 
consider the simple function y and the corresponding expectation, 


(4.5) ey = | vaoo) 


Since f (X) and Y are the same random variable, we should hope that (4.4) 
and (4.5) are equal. That they are equal can be obtained directly from 
the definitions of the two integrals; see Problem 23. 

The idea of conditioned probability has its origin in a simple experi- 
mental situation. If an experimenter knows that an outcome fulfills some 
condition, then what probability measure represents the outcome? The 
interpretation in terms of the frequency ratio is of course possible only 
when there is a positive probability that the condition will be fulfilled. 
We first define conditional probability in this simpler situation. 

Let C be a subset of Z, and suppose that eC is the condition the 
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experimenter knows the outcome fulfills. If P(C) > 0, then we define the 
conditional probability of A, given C: 


P(An C) 


(4.6) P(A|C) = rO 


It is easily seen that P(A|C) is a probability measure as a function of A. 
The frequency interpretation of probabilities requires that the left-hand 
side be the ratio of the times that the outcome is both A and C to the total 
times that it is in C. The right-hand side expressed in terms of frequency 
ratios clearly reduces to this. In the notation of random variables we 
could write 


(4.7) P(A n C) = P(C) P(A|C) 
in the form 
Pr {X € A, Xe C} = Pr {X¥ eC} Pr {Xe A|X EC} 


Thus for the conditional probability P(A |C) we can speak of the probability 
that the random variable X falls in A, given that it is in C. 

We do not attempt to define conditional probability given C, when 
P(C)= 0. In fact, any definition whatsoever could be used and would 
not violate the distribution of probability as given by the measure P(A). 
However, if we havea statistic t(x) and if C = {| (x) = t}, then a definition 
is possible in relation to the statistic (x). Here again there is a degree of 
arbitrariness in that the conditional probability may remain undefined or 
arbitrarily defined for a set of values of (a) = t having probability zero. 

To obtain the more general definition of conditional probability we need 
some results from measure theory. We define absolute continuity for 
measures: 


»(A) is absolutely continuous with respect to (A) and written 
v(A) << yA) if, whenever (A) = 0, also v(A) = 0. 


Consider a few simple examples of absolute continuity; the proofs are 
given as problems in Section 8. First, the binomial distribution has a 
probability measure absolutely continuous with respect to the Poisson but 
not with respect to the normal. Second, the normal distribution measure 
is absolutely continuous with respect to Lebesgue measure. Third, all the 
distributions on R! given by a simple probability density function, 


ra= [fe de, 


are absolutely continuous with respect to Lesbesgue measure. 
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The Radon-Nikodym theorem is a powerful theorem concerned with the 
property of absolute continuity. We state the theorem here but do not 
attempt a proof. For the proof and further reading, see Halmos [2]. 


THEOREM 4.1. (THE RADON-NIKODYM THEOREM). A necessary and 
sufficient condition that a measure »(A) be absolutely continuous with 
respect to a measure (A) is that there exist a non-negative function 
g(x) such that 


(48) 4) = | 8) duce 


for every A € Z. g(x) is determined uniquely except on a set having u 
measure zero. 


The last sentence in the theorem has the following interpretation. If there 
are two functions g,(x) and g(x) satisfying (4.8), then those functions will 


be different at most on a set having x measure zero. 
Consider an example of the above theorem. Let P(A) be the measure of 


the Poisson distribution with parameter m. Then for a non-negative 
integer v, 


em 


m” 
x! 


P(x) = 


Let N(A) be the number of non-negative integers in the set A; it is easily 
seen that M(A) isa measure. Now, if the set A has N(A) = 0, then there 
are no non-negative integers in it, and therefore the Poisson measure 
P(A) =0. By the Radon-Nikodym theorem there must exist a function 
g(x) satisfying 

Pca) = | s aN), 


and by taking A to be a typical non-negative integer it is seen that 
m” 
P(A) = ji pe dN(x). 
bal oe 


We now extend the definition of conditional probability. Let (x) be a 
statistic from 2(./) to F(A). See Fig. 1. Also let P (4) be a measure 
over Z, and let P(A) be the measure over J induced by the statistic 
t(x). As an example of conditional probability by our definition above, 


we have 
Px {Aon -(B)} = Px{A"|-(B)} Px{t(B)}, 


or equivalently 
(4.9) Px{4? n -(B)} = Px{A°|-(B)} Pr{B}. 
In words the equation says that the probability that Y € A’ and 1(X) € B is 


14 PROBABILITY CONCEPTS [1.4 


equal to the probability that :(X) € B times the conditional probability that 
X € A®, given (X)€B. In attempting to obtain Px {A°|1-(B)} when B 
reduces to a point, we might look for a function of t, Px{A°|t(x) = t}, 
which gives the right answers in the sense that it satisfies the relation 


{xl (x) = t} 


Figure 1. The probability mapping for conditional probability. 


(4.10) Pfa n B= | PEAY HG) = 1) dP gO) 


for all BEY. The Radon-Nikodym theorem will give us the function 
P{A?| t(x) = t}. 

We obtain a new measure on 2’, say »(A), by deleting all the probability 
that is not contained in A°: 


HA) = | dla) dP x) 
= PAN AY, 
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where ¢.,(z) is a characteristic function defined in (4.2). Now by the 
Radon-Nikodym theorem, » << Px; in fact, this relation is obvious 
since actually »(4) < Px(A) for all A. The measures on 7 induced 
from »(A) and P(A) are, respectively, 

vr(B) = v(t-(B)), 

Pr(B) = Px(r(B)). 
Now these induced measures also satisfy the same relation of absolute 
continuity, »p << Pp. This is easily seen since (4) < P(A) implies 
¥_(B) < P(B), and hence vp << Pr. 

Now, applying the Radon-Nikodym theorem to vp << Py, we have 

the existence of a function of t, say P(A°| 1), such that 


(4.11) ¥q(B) = j Parl t) dP,(t). 


Since vp depends on 4°, in general the integrand furnished by the Radon- 
Nikodym theorem will also depend on 4°, and we have indicated this. 


Noting that 
p(B) = (B) 
= Py{4 n 1-(B)}, 
we can rewrite (4.11) as 
(4.12) Px {Aon -(B)} = I P(A|1) dP a(t) 


where P(A°|t) is uniquely determined except at most on a set having Py 
measure zero. This is the equation (4.10) we set out to obtain. 

In exactly the same manner it is possible to define the conditional 
expectation of a real function A(x) with respect to a statistic (x) and a 
random variable X over Z. If h(x) > 0, the Radon—Nikodym theorem 


gives the existence of E{h(X)|t} to satisfy 
(4.13) Í h(a’) dP x(x) = l E{h(X)|1} dP p(t) 
t-1(B) JB 


for all BEB. Efh(x)|t} is uniquely determined except on a set of values 
of t having Py measure zero. If h(x) is not necessarily positive, it can be 
written 
h(x) = h*+(x) — h(x), 
where h+(x) > 0, A-(x) > 0, and 
h+(x) = h(a) if A(x) >0 
=!) otherwise, 
I(x) = —A(x) if A(z) <0 
=0 otherwise. 
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Then (4.13) can be shown to hold in general by applying it to the two 
components of A(x). If B is taken equal to 7, (4.13) can be written 
Ex{h(X)} = Ep\E{A(X)|T}}- 

It is perhaps natural to hope that P(A |t) as a function of 4 be a measure 
on (sZ). However, in general thisis not true. Doob [4] has shown that, 
if Z is Euclidean, then it is possible to determine P(A|1) so that it is a 
probability measure for all ¢ except perhaps for ¢ values having probability 
zero. It is straightforward to prove that P(A|t) has some properties of a 
measure: 

(i) P(2|t) = 1, except perhaps for ¢ in a set having Pp measure zero. 

(ii) 0 <P(A|t) <1, except perhaps for ¢ in a set having Py measure 
zero. © 

(iii) If Ay, Agtt is a sequence of disjoint sets, then > PA: |9= 


a y i=l 
r( UA), except perhaps for ¢ in a set having Pp measure zero. 
i=l 


The difficulty is that we would want these conditions for all A, A,, As, *** 
to hold simultaneously, except for t values in a set of measure zero. 

Conditional expectation can also be defined directly from a determina- 
tion of conditioned probability. Consider the expectation of h(x) relative 
to the statistic t(#) and the probability measure (A) over Y. Let 


(4.14) D= > dk Pxifel ak < ha) < Mk + 1)}(4. 
k o 


If E{h(X)} exists, it can be shown the sum converges, that s,(t) approaches 
a limit as 2— 0, and that this limit is a determination of E{h(X)|1} as 
given by the defining equation (4.13). For the details, see Kolmogorov [5]. 
We shall frequently need to qualify a condition with a phrase such as 
“+++ except perhaps for ż in a set having Py measuye zero.” For con- 
venience we abbreviate this to “ --- almost everywhere (Pp)? and even 
further abbreviate to “+++ a.e. (Py).” 


5. SUFFICIENT STATISTICS 


We have been considering only a single probability measure over a space 
2). However, in applications we shall seldom know the probability 
measure applicable to an experiment, but we may be able to say that it is 
one of a class of probability measures. In fact, we can always do this by 
choosing a sufficiently large class. 

Let {P,(A)|0 E Q} be a class of probability measures over 2(.o7). Ois 
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called the parameter of the class, and it indexes the probability measures. 
The range of 0 is Q and is called the parameter space. In our statistical 
model the class {Pa|0 € Q} will consist of all the probability measures 
which the statistician on the basis of previously obtained information 
considers to be possible representations of the random experiment he is 
investigating. 

In general, the statistician’s purpose is to make some sort of statement or 
decision about the probability measure applicable to his experiment: that 
is, to make a decision about the value of the parameter 0. After choosing 
the class {P,|0 e Q}, the only information he has to guide his decision 
about 0 is the result of the experiment, the random variable ¥. However, 
in many cases the outcome X is a complicated set of numbers, and, as 
mentioned in Section 1 when statistic was introduced, a simplification is 
desirable, If at all feasible he should choose for his condensation a 
statistic which loses as little as possible of the information contained in the 
outcome and relevant to the parameter 9. It is this desire which prompts 
the definition of sufficient statistic. 

R. A. Fisher introduced the concept of a sufficient statistic as one 
“containing all the relevant information” in an experiment. To illustrate 
this idea, consider the class of probability measures over R" obtained from 
(Xi ++, ¥,) where the X; are independent and each is normally distributed 
with mean & and variance 1, ë taking any value from —oo to +00. Here 
0= € and Q= R!. It can be easily shown, and will be later in this 
section, that the conditional distribution of (7 — F, x, — F, +++, a, — 4), 
given the sample mean #, is independent of the parameter ë. Con- 
sequently, to examine the values of x; — Ë, +75, a, — 4, after examining 
the value of 7, is equivalent to taking an outcome from a given fixed 
distribution having nothing to do with. Itis in this sense then that we say 
the statistic y = @ contains “all the relevant information.” 

Fisher gave a more formal definition of a sufficient statistic y = ¢(x): 
For any other statistic r(x), the conditional distribution of ¢’(x), given t(x), 
is independent of 0. Halmos and Savage [6] give a definition of sufficient 
statistic to fit the framework of our probability model. In our notation, 
it is: 

A statistic y = t(x) is a sufficient statistic for the family of probability 

measures {Po| 0 e€ Q} over (A) if there exists a function P(A |ð such 

that 


(5.1) 


for all Ac 2, BEB: that is, if there exists a determination of the 
conditional distribution, given (x), which is independent of 0. PË is the 
measure for t(x) induced from the measure Py over E. 


pjan t-4(B) = f. P(A|D APTO) 
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A sufficient statistic is important in problems both of estimation and of 
hypothesis testing. In the next chapter we shall have a general result and 
results particular to the two fields; estimation and hypothesis testing. 
However, the description of Fisher’s approach should indicate that we have 
justification for ignoring the outcome in an experiment if a sufficient 
statistic is available. Of course the outcome itself forms a sufficient 
statistic, but we shall usually be interested in a statistic that makes a 
reduction in the problem. 

We give now two examples of sufficient statistics. 


EXAMPLE 5.1. Consider the class of probability measures over R” 
mentioned earlier in this section. X = (Xj,---, X,), where the X; are 


independent, each has the normal distribution with mean é and variance 1, 
n 


and £ e]— œ, + oo[. We wish to show that # = n~! > x; is a sufficient 


statistic or equivalently that s(x) = n/*z is sufficient. i 

For convenience we introduce a new random variable Y = ( Y}, +++, Y,,) 
by means of the transformation} y'= Mx’ where M is an orthogonal 
nxn matrix. Giving M the first row (n-/2,-++, 2-1/2), we find that 
yı = m/e = t(x). The X distribution over R” is 


n 


(5.2) PX(A) = er | exp [- = X p= a] TI de; 


where c = (27)-!/*._ The transformation from x to y has Jacobian 1, and 


t=1 


> — B= — OP + > vs 


therefore 


n 


1 1 n 
$3 PX(A*) = c" f l-3 == piei = 2 ; 
(5.3) 2 (A*) = c JP z% n™!2£) 5 ` y I d: 


i=2 


To exhibit y, = t(x) as a sufficient statistic, we need to show that the 
probability measure PŽ can be used in formula (5.1). We have the 
relation 


PX(A n t-(B)) = PX(MA n Mt-(B)), 


where MA for example is the set obtained by applying the transformation 
M tothe pointsin A. (B)is the set of points having ¢(x) in B; therefore, 


ty= Us ***, Yn) is a row vector and the matrix transpose y’ is a column vector. The 
transformation then is given by the ordinary matrix multiplication Mx’. 
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since y, = (x), Mt-(B) is the set of points having the first coordinate y, in 


B, and we can write 
Mt-(B) = BX RE, 


Then we have 
(5.4) PX(A n t-(B)) 
= PX(MA n BX R") 


n 


g i [fst (Yass Yn) EXP (- > it) I asi] 


Il 


> 


I 
exp |- 5 i= megs] dy 


Il 


1 5 
c | paral) exp [- 3 u= mug] dy. 
JB 


P(MA EAN defined by the bracket in the previous expression, is the 
probability measure of MA for fixed y,. It is seen that formula (5.4) is in 
the form (5.1) and hence that (x) = n/*z is a sufficient statistic. 


EXAMPLE 5.2. Let Xj, X_ be independent, and let ¥;= 1,0 with 
probability, respectively, p, q = 1 —p. Consider the probability measures 
of (X1, Xa) over R? for p E J0, 1[. By using the first and simpler definition 
of conditional probability it is quite easy to show that (x) = x, + % isa 


sufficient statistic. A 
It is seen directly from the definition of a sufficient statistic that the 


following theorem is obvious. 
Turorem 5.1. If z(a) is a sufficient statistic for {P,|0 € Q}, then (2) is 


sufficient for {P,|0 € w} where œ is a subset of Q. 
In most statistical problems the class of probability measures can be 
represented by means of a class of density functions. Such is the case if 


the class of measures is dominated. 
A class of measures {P,|0 € Q} is dominated if there exists a measure u 


such that Py << u for all 0 € Q. 
In other words all the measures in Py are absolutely continuous with 


respect to u. Applying the Radon-Nikodym theorem, we have the 
existence of a function p,(x) such that 


PAA) = | pote) duce) 


for all Ae WH, 0eEQ. The function p,(x) is called a probability density 
Junction with respect to 4 and is uniquely determined except on a set having 
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u measure zero. Thus a dominated set of probability measures {P,|0 € Q} 
can equivalently be given by a class of probability densities {p,(x)|9 € Q} 
with respect to a measure «x. The usual discrete and continuous dis- 
tributions in R” are of this type. 

In 1935 Neyman formalized a criterion for a sufficient statistic when the 
distributions are given by probability density functions. Under restrictive 
conditions it was that the density function should factor into two parts, one 
depending only on x and the other only on 0 and the statistic. In 1949 
Halmos and Savage [6] generalized this criterion in the following theorem. 


THEOREM 5.2. (HALMos and SAVAGE). A necessary and sufficient 
condition that (x) be a sufficient statistic for the class of densities {p° 
re the measure / is that there exist measurable functions A(x), gọ(t) such that 


(5.5) Pol) = hx) go(t(x)) 
almost everywhere (2) where h(x) > 0 and is integrable with respect to x. 


Proof. See Halmos and Savage [6]. This condition for a sufficient 
statistic we shall call the Neyman criterion. 

As an illustration consider the example involving normal distributions 
mentioned earlier in the section. For X= (Xj,---+, X,), where the Y; 
are independent and each is normal with mean & and variance 1, we have 
the density function 


p(x) = c” exp | +i (x, a] 


= o ekp [ 5> (x; a exp |- inë -= a] r 


h(x) = exp — >) (x, — #)?, 
i= 


Letting 


1 
£o(%) = c” exp [- z née — a, 
we see that p.(x) factors as required in the theorem. (x) is positive but 


unfortunately is not integrable as required. However, if we slightly alter 
our definitions, 


h'(«) = exp [- ; > (x, — a] exp |- 5 næ] y 
izi 


golz) = c” exp = ina — a| exp [+ 5 net, 


1.5] SUFFICIENT STATISTICS 21 


it is seen that the conditions of the theorem are fulfilled. Hence ž is a 
sufficient statistic. 


> Other conditions for a sufficient statistic with respect to densities have been 
given by Koopman [7]. For these we first describe a homogeneous set of 
probability measures. {P, | 0 EQ} is a homogeneous set of measures if, for every 
0,0°€ Q, Pa << Pg. For a homogeneous set we have Py << Py: and Py << Po. 
We designate the combination of these statements by Py = Py and say the 
measures are equivalent. If we interpret the condition of homogeneity for the 
corresponding class of densities, we find that the region Sy of positive density, 


So = {x | pox) > 0}, 


must be independent of 0 a.e. (x); that is, “(Sy — So) = 0 for all 0, 0 € Q, 

Koopman [7] proves that, if a homogeneous set of densities over R” was 
obtained from independent and identically distributed random variables and 
satisfied certain regularity conditions, then the density function for the component 
random variable must factor into terms of an exponential form: 


a 
(5.6) pol) = e(0) exp [> aO) hil) + w|; 


s=1 


If the region of positive density Sọ is not independent of 0, it is possible in 
some cases to obtain a sufficient statistic as a combination of a statistic designed 
to detect the variation of Sọ and of a sufficient statistic for a related class of 
densities for which the region of positive probability is independent of 0. The 
method is described in [8] where a constructive procedure is given for obtaining 
a sufficient statistic when all the dependence on 0 is through the set Sọ: that is, 
if the density is of the form 


pole) = Ogs) g(x) 


where ¢.,(2) is the characteristic function of the set Sp ,and g(x) is independent 
of 6. Subject to some regularity conditions, the sufficient statistic is given by 


t(x) = N So 


and its values are subsets of 2°. 
It often happens that two independent random variables are considered 


simultaneously. If each random variable has a class of probability measures 
and a sufficient statistic, then the following theorem gives a sufficient statistic 


for the combined experiment. 
THEOREM 5.3. If X; has distributions {Pä} |0; € Q;} over Zi x; (i = 1, 2) 


and if ¢,(«,) is sufficient for 0; € Q;, then (¢)(%), f(a) is sufficient for the class of 
product measures given by (04, 02) E Qy X Q, (under the assumption that the 


j TP 


S.C.E R.T., West Benga S ii 
Date. Arb eQ. D T p 
Rec. No.. LAD Bune = 


22 PROBABILITY CONCEPTS [1.5 


Proof. From the assumption that 1,(2,), ta(®ə) are sufficient we have, for all 
A, As, By, Bo, 


PMA, n 1 1(B,)) = Í PA, | ty) dP3 (n) 


By 


Pi(Ag N ty (Bs) = | P(A» | to) dj.2P (ts). 


By assumption P,(Aj|ty), Po(Aa|f2) are measures over 2, T2; let P(A\(t, ty) 
be the product measure over 2, X Zə. Then, if Z = Z} X Za t = (ty t) 
0 = (0,, 03), X = (Xr, Xo), we have 


(5.7) P(A n (B)) = Í P(A\1) dP2(t) 
B 


for all A = A, X A, and B = B, x By. But both sides of the above equation 
define, by means of A, a product measure on 7. The measures are identical 
for all product sets A = A, x Ag; therefore they agree for any A in the product 
space o-algebra. This implies that (5.7) holds for all A € CZ Xo) and for all 
B = B, x By A similar argument gives equality for all B E€(2,, 2a) and 
establishes that (f;, tə) is sufficient for (04, 03) EQ, X Qa. 


EXAMPLE 5.3. Let X = (Xj, ** +, Xn) where the X; are independent and each 
is normally distributed with mean £ and variance g?. Similarly let Y = ( Y}, °°’, 
Ym) Where the Y; are independent and each is normally distributed with mean 7 
and variance 72. By Problem 33 (Z, (n — 1)* E(%; — )*) is a sufficient statistic 
for (é, o2) and (g, (m — 1) Ely; — 9?) is a sufficient statistic for (7, 7). 
Then by the above theorem [*, 7, (n — 1) D(a; — 7), (mn — D Ely; — 97) 
is a sufficient statistic for the combined experiment with &, n €]—, + of and 
2, 72 €]0, of. 


In problems of estimation and hypothesis testing it often happens that one 
parameter in particular is being considered while other “nuisance” parameters 
are present, For some of these problems a generalized definition of sufficiency 
can be applied. If ¥ has the measures {Pn | (0, 1) E0 x H} over T( A), 
then we define: s(x) is a sufficient statistic (0) for the family of measures 
{Pi |©, MEO x H} if there exists a function P,(A |© such that 


P(A n t7(B)) = Í P (A| Ò dPE(O 
B 


forall AE , BEB, The induced measure of t(x), P(B) is independent ofn. 
The meaning of this definition is straightforward. The statistic (7) has a distri- 
bution depending only on 0, the parameter of interest, while the conditional 
distribution, given t(x), depends only on the nuisance parameter 7). 4 
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6. COMPLETENESS 


We have just investigated a property that a statistic may have relative to 
a class of probability measures. That property was sufficiency. We now 
consider another such property: completeness. ` Although we shall use the 
term completeness for a statistic, actually it is a property of a class of 
measures and when applied to a statistic will be in reference to the induced 
measures of the statistic. 

A family of measures {P7|0 EQ} is complete if the fulfillment of 


(6.1) E,{h(T)} = | h(t) dP2(t) = 0 


for all 0 €Q. and any real statistic h(t) implies that h(t) = 0 almost 
everywhere with respect to each of the measures PR. 
In general, the expectation of a statistic will depend on the measure used 


in taking the expectation; that is, 
(6.2) E,{h(T)} = 80), 


a function of 0. We say for this equation that h(t) is an unbiased estimate 
of g(0), but leave until the next chapter a justification for the term unbiased 
estimate. Then it is possible to interpret the completeness of the measures 
of a random variable by saying that there does not exist an unbiased 
estimate A(t) of zero other than the trivial unbiased estimate which is zero 


almost everywhere. 

We shall say a statistic t(x) is complete relative to the measures 
{Pj | € Q} over 2 (of) if the induced class of measures {PF |0 € Q} over 
T = (X) is complete. 


Consider the random variable X = (Xj, ‘++, Xn) where 
ndent and each has the normal distribution with mean 4 
sures over R” be those obtained from 
We demonstrate that the statistic 


EXAMPLE 6.1. 
the X; are indepe 
and variance 1, and let the class of me: 
all values of u in ]—%, +oof. 


n 


=n! ` æ; is complete. Ë takes its values in R* and has an induced 


t=1 è 
distribution which is normal with mean 4# and variance 1/n. Thus we 
wish to show that the class of densities 


(E [o-r] 


uel]ļ—%, +0 
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is complete. To do this we consider a statistic which has expectation zero 


for all yw. 5 m 
[ne (2) "ex [2a — 04] av =o, 
=o 2a 2 


or, by removing the nonzero constant factor, 


| h(y) exp j= 5 v] exp [nyu] dy = 0. 
By letting nu = v, we have 
m n 
| h(y) exp [- z | exp [vy] dy = 0, 
and this equation states that the LaPlace transform of the function 
a 
h(y) exp (- Zae) 
2 
is zero identically. But the function 0 also has the transform which is 


zero identically. Hence, by the uniqueness property of LaPlace transforms 
it follows that 


> 


h(y) exp (- Zye) = 0; 


and hence h(y) = 0, almost everywhere (Lebesgue). We have proved that 
our statistic with zero expectation is zero almost everywhere. This 
establishes the completeness of the class of densities for &, and we say that 
& is a complete statistic. 

In the theory of hypothesis testing a modification of the above definition 
is useful. 


A class of measures {PF |0 € Q} is boundedly complete if 
E,{h(T)} =Í A(t) dP? () = 0 
J7 


for all 0 € Q, and any real statistic h(t) satisfying 
|A()|<M 


implies that h(t) = 0 almost everywhere with respect to each measure PT. 
In words the condition is the nonexistence of a bounded real statistic having 
zero expectation other than the trivial statistic which is zero almost 
everywhere. 


We now have a simple theorem concerning completeness and bounded 
completeness, 
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THEOREM 6.1. If a class of measures is complete, then it is boundedly 
complete. 


Proof. Given completeness, (6.1) is sufficient to prove A(t) = 0 almost 
everywhere (Pf). Then certainly (6.1) plus a condition of boundedness 
will prove the same. 


However, the converse to this Lemma is not necessarily true. This is 
illustrated by the following example given by Girshick, Mosteller and 
Savage [9]. 


EXAMPLE 6.2. Let T be a random variable standing for the measure 
which assigns probability q, p°, p°qg,**', p?g',+** respectively to the 
integers 0, 1, 2, +++, i + 1, + ++, where q = 1 — p, and consider the class of 
measures obtained by letting p range over JO, 1[. We show that this class 
is boundedly complete but not complete. 

A statistic with zero expectation for all the measures in the class will 
satisfy 

SO +S (Up? HOP +++ =0 
for all p € ]0, I[. Now by rearrangement we obtain 


SM) + D +O 
= —f (0gp 
= —f Og — q)? 
= -f 0) Q + 297 + 39° + - +>). 


Thus we have two power series which are identical for q €]0, 1[. It 
follows then that the corresponding coefficients must be equal. 


S()=9, 
[@=-SfO, 
Si) = —G— DSO. 


This determines the form of any unbiased estimate of zero. If /(0)= 0, 
then f(r) = 0 at all non-negative integers; otherwise the function f(t) is 
unbounded. Thus there are nondegenerate unbiased estimates of zero, 
but there are none that are bounded. We conclude that the class of 
measures is boundedly complete but not complete. 

In the last section we found that, if a statistic was sufficient for a class of 
measures, then it was sufficient for any subclass of those measures. The 
property of completeness works somewhat in the opposite direction: if we 
have completeness for a class of measures we can sometimes infer com- 


pleteness for a larger class. 
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THEOREM 6.2. The completeness of {Pj'|0¢Q} implies the com- 
pleteness of {PF | 0 €@} if Q is a subset of & and if none of the added 
measures assign positive probability to sets having zero probability for 
each measure of 2. This second condition implies that almost everywhere 
{P7 |0 € Q} is equivalent to almost everywhere {P7 | 0 e6}. 


Proof. We first prove the last statement in the theorem. The condition 
obviously implies that any set, having zero measure for each P7 , 0 €Q, 
also has zero measure for each P7 , 0 e Q; the converse is trivial. But this 
is just another way of stating the equivalence in the last sentence of 
the theorem. 

By examining the definition of completeness we see that it is necessary to 
show that a function h(t) which satisfies certain conditions is zero almost 
everywhere with respect to a class of distributions. Introducing more 
distributions imposes more conditions. The original conditions (Q) were 
sufficient to prove (ht) zero almost everywhere {P7 |0 eQ}. But, since 
almost everywhere {Pf |0eQ} is equivalent to almost everywhere 
{Pi |0 E€}, we have completeness for the larger class, and hence the 
theorem is proved. 


> Completeness of a class of product measures over a product space can 
sometimes be obtained from completeness over the component spaces. For 
this we need a modification of the concept of completeness: 
The’ measures {PT |n € H} are strongly complete with respect to a measure 
m on H if, for any subset H* of H for which m(H — H*) = 0, the condition 


(6.3) E,{h(T)} =0 


for all n€ H*, and any real statistic h(t) implies that h(t) =0 almost 
everywhere {Pe |n eH}. 
This is a stronger property than completeness; it requires that any unbiased 
estimate of zero for a subclass of the measures is necessarily zero almost every- 


where with respect to the full class, provided the measures omitted form a set 
having m measure zero. 


THEOREM 6.3. If the random variable T is complete re {P7 |0 EQ} over 
7 (2), and the random variable T’ is strongly complete re tee: |n € H} over 


7(f), then (T, T’) is complete re the class of product measures {PF x P2" | (0,7) 
EQ x H} over F x F’, j 


Proof. It is necessary to prove that the condition 


(6.4) Í A(t, t) dPG(t) dP? (r) = 0 
FRI 
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for all (0,7) implies that A(z, t’) =0 almost everywhere {PZ x P|, n) 
EQ * A}. Defining g¢,(1) by 


g= f ht, rjdPT (r) 
vz 

and applying Fubini’s theorem concerning the interchange of order of integration, 

we obtain 


| so dP?) =0 


for all 0€Q. But from the completeness of {Pf} we have that for each n, n (D) 
= 0 almost everywhere {PT}. Now, if we treat &0 as a function of 9 and ¢ 
over the product space H x 7, then it follows that, for almost all {PT} values 
of £, ¢,(t) = 0 for almost all (9) values of n. But, using the form of g(r) and 
the strong completeness of {P7"}, we obtain that, for almost all {PT} values of t 
and for almost all (PP } values of 1’, A(t, t’) =0. This is what we set out to 


prove. 4 
7. TWO EXAMPLES OF COMPLETENESS 


In Problem 29 a statistic of particular interest in nonparametric theory 


is introduced. In that problem the sample space is R”, and an out- 
come is a point x = (ty, * * +, %,) in R”. The statistic is t (x)= (aq)," > *, en), 
where tup’, Xm are the numbers ty, ***, %, arranged in order of 
magnitude from smallest to largest: tay < t <*** < tm. Thus the 
statistic gives the magnitude of the numbers t}, * * *, Xp but not the order in 
which they occur. This statistic has been called in the literature the “order 
statistics”; we shall call it the order statistic in accordance with the 
general definition of statistic. 

If we consider this statistic in terms of a partition of the sample space, we 
can let (x) stand for the set of points formed by (a, *+ +, x,,) and all the 


points obtained by permuting the coordinates %4, ***, w,. It is easily seen 
that the set ¢(x) contains at most 7! points. Alternatively the order 
Statistic can be given as a mapping from the point (%1, ***, x, to the set 
whose elements are 2}, * * *, %, that is, the set {ay,°+-, %,}. Itis easily seen 


that this is equivalent to the above definitions by examining what points 
are mapped into a given set {tı °°", x,}: Obviously the points are those 
obtained by permuting the numbers 2, ` * otn and using the resulting 
permutation as a set of coordinates for a point. 

In some nonparametric problems the probability measures under a 
hypothesis are symmetric in the coordinates 2,,°**,%,. Each permutation 
of (xi, +++, x,) thus bears the same relation to the problem as any other 
permutation. Under an alternative hypothesis there may be asymmetry. 
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In such a situation the original ordering of the numbers 24, - + +, 2, before 
it was lost in calculating the statistic ¢(x) is all important. Here the 
statistic does not extract all the important information contained in the 
outcome, but nevertheless the statistic is useful because for some of these 
problems as we shall find in a later chapter, the statistician constructs his 
test in the subspace given f(x) as if it were the sample space for the problem. 

Let Xi, +--+, X, be independent, and let each Y; have the same distribu- 
tion function F(x) on the real line. Problems 29 and 30 are to prove that 
the order statistic (x) forms a sufficient statistic for the class obtained 
by considering all absolutely continuous distributions F(x). In a later 
chapter we shall show that (x) is sufficient for the class obtained by 
considering all distributions F(x) on the real line. By Theorem 5.1 7(x) is 
then sufficient corresponding to any class of distributions on the real line. 
The order statistic also has the property of completeness provided the 
class of distributions is not too small. In this section we prove the order 
statistic complete corresponding to the absolutely continuous distributions 
on R! and corresponding to the discrete distributions on Rt. For this we 
need a lemma on homogeneous polynomials. 


Lemma 7.1 (HALMos). If Q(p,, +++, p,,) is a homogeneous polynomial 
of degree greater than 0 and satisfying Q(p,,+++,p,,)=0 whenever 


O<p;<1 @=1,+++,n) and an =1, then Q(p,,°+-,p,) is zero 
identically. I 


Proof. When n= l, the lemma is trivial. The proof follows by 
induction, assuming that it holds for n — 1 and proving for n. 


If we replace each p; by cp; then from homogeneity a power of c will 
factor out leaving the original polynomial. Therefore the restrictions 
0<p;<1 and Xp; = 1 may be replaced by the restrictions 0 < p; 
(i= hy eea 

If we write Q(p,,°-*,p,) as a polynomial in p,, we have for given 
Pis***s Pa~ that it is identically zero for all p, > 0. Hence the coefficients 
of the different powers of p, must be zero. But, since these coefficients are 
homogeneous functions of 7 — | variables, the lemma follows by induction. 

Let F(x) define a distribution which assigns probability pi, pe, ** +; Pn» 
respectively, to the disjoint intervals /,, J», * ++, Z, on the real line (Xp; = 1). 
Within each we assume that the distribution is uniform; that is, we assume 
that the distribution has a density function which is constant-valued 
within each interval. We call such a distribution uniform within intervals. 


_ THEOREM 7.1. The order statistic (x) is complete for the class of 
distributions over R” corresponding to each coordinate having the same 


ie TWO EXAMPLES OF COMPLETENESS 29 


distribution function F(x) which is any distribution uniform within 
intervals. 


Proof. We need to show that any real function A(t(x)) of the order 
Statistic satisfying 


(7.1) Efh(t(X))} = 0 


for all the given distributions is necessary zero almost everywhere with 
respect to Lebesgue measure. First we find a convenient way of expressing 
a function of the order statistic (x). Obviously, any function of (x) is a 
symmetric function of the x}, +++, p. Conversely, any symmetric function 
is a function of the x;s which does not depend on the order in which they 
are inserted into the function and hence is a function of theset {æ}, +++, 2,,}: 
that is, is a function of (x). We therefore consider any symmetric 
function A(x, +++, x,) having zero expectation, and prove that it is zero 
almost everywhere. We have 


(7.2) O= E{h(X, °° +, XD} 
n n 
= > ; > Pa +42 E S] 
i= i,=1 
where 


CI Sito iy) = fli i A -f Mes stn) II de; 


and 1(1), +++, 1(m) are the lengths, respectively, of the intervals 4, +++, Zp. 
Since A(x}, +++, x,) is symmetric, so also is J(4,***,7,). Therefore (7.2) 
can be written 

(7.4) O= Spit- -° pir clay, +s an 


where the summation is over all non-negative integers a” 5 Ay satisfying 
Xa, = n and where c(a, ***, @,) is an integral multiple of the J(i,, +++, ip) 
having a, of the i,’s equal 1, ag of the i,’s equal 2, and so on. 
TA z 1 ct oye 
The expression on the right-hand side of (7.4) satisfies the conditions of 


Lemma 7.1. Therefore, 
c(a, ` * *, An) = 9, 


and hence 
Ties see in) =0. 


It follows then that 


(1.5) Í, -f her en) IT des = 0 


iy in 


for all i, 1,°+-, n and all disjoint intervals J), +++, Zp. 


a Ty 
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The expression (7.5) determines a measure for all product sets J; +++ I; > 


1 
and this measure is zero. An extension of the measure (7.5) to all Borel 
sets is given by 


[fee eat 


as determined by the left-hand side and by 


fji 


as determined by the right-hand side. But, since the two extensions of the 
measure must be identical, we have by the Radon-Nikodym Theorem 4.1 
that A(x, >t, 2p) = 0 almost everywhere. This completes the proof. 
Heretofore we have extended measures and used the Radon-Nikodym 
theorem only for positive measures. Actually we need the theory for 
measures which may take positive and negative values. For references on 
this account see Halmos [2]. 


COROLLARY. The order statistic z(x) is complete for the class of 
distributions over R” corresponding to each coordinate having the same 
distribution function F(x) which is any absolutely continuous distribution. 


Proof. By Theorem 6.2 we have completeness for any class of absolutely 
continuous distributions for F(x), provided these contain the distributions 
uniform within intervals. 


> The theorem we have just proved is a particular case of a theorem which we 
shall give below without proof. The proof may be found in [10], together with 
a more general version covering the combination of a number of order statistics 
from separate experiments. 
We first define some necessary concepts. Let 2(.«/) be a measurable space. 

A class 7 of subsets of £ is a ring if the conditions 
(1) If Ay, A, 7, then Ay U AET; 
(2) If Ay, A, ET, then A, — AEF; 
are satisfied. If (1) is replaced by 

o 
(1) If Ay, An- ET, then U AET, 

i=1 
then of course the ring is a o ring. We shall say that a ring 7 is a basis for the 
o-algebra sZ and write 7 = Bas (Z) if the smallest o ring containing 7 is 
the o-algebra «7. Thus 7 generates Z under the operations of taking countable 
union and difference. 


THEOREM 7.2. If (A) is a given nonatomic measure over (s), if Py is 
the measure of a uniform distribution over the set B with respect to the measure 
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#, and if {P} |B € Bas (.)} is the class of power-product measures P% over 
T” for the sets B of a basis, then #(2,,---, %,) = {®t 2} is a complete 


Statistic. 
We interpret some of the statements in the theorem. P will have the form 


P(A) = [aw du(), 


where ` 
pl) =c 2xEB 


=0 vE B. 


If Ppisa probability measure, c = #(B). A measure x(x) is nonatomic if there 
does not exist a set A having (A) # 0 and such that, for any C C A, either 
4(C) = 0 or u(C) = #(A). If such a set existed, it would be called an atom. 
The power-product measure P% is the measure over 2" obtained from the 
measures Pp for each coordinate combined according to “independence” as 


given in Section 3. 
The order statistic is also complete for sampling from suitable classes of 


discrete distributions. We have an analog of Theorem 7.1, 


THEOREM 7.3. (HALMOos). The order statistic r(x) is complete for the class of 
distributions over R” corresponding to each coordinate having the same distri- 
bution function F(x) which is any discrete distribution on the points of any set B. 


Proof. There are no restrictions on the set B. Of course, any discrete distri- 
bution can have probability on at most a countable number of the points of B. 
The proof follows in the pattern of that for Theorem 7.1 with the intervals 
replaced by points. If it should happen that B contain fewer than 7 points, 
the argument remains valid since the only restriction in Lemma 7.1 is that the 


degree of the polynomial be greater than zero. 4 


8. PROBLEMS FOR SOLUTION 
n 
1. If 4 is a o-algebra, show that, if A,, 42, *** E A, then ne EW, 
i= 
2. If sZ is a o-algebra, show that Z E€ £. 
3. If 4, is any group of o-algebras on the space T, show that f) Z, is also a c- 
a 


algebra on Z. Ofcourse, N Zaconsists of sets A which belong to each of the o-algebras 


a 5 
Aa Hence () /,, C , for each 2, and we say that n ,, is a smaller o-algebra 
a 


than Z4. " 
4. On the product space F, X +t X Za is there necessarily any o-algebra con- 


taining all the product sets A; X *** X An where A; E /,? For this show that the 


class of all subsets of a space 7 is a o-algebra. 
5. Use Problems 3 and 4 to prove the existence of the natural o-algebra on a product 


Space Z, x +++ x Pa 
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6. Prove that relations (2.3) and (2.4) hold for « ranging over any index set J. 

7. Prove that Z* defined by (2.2) is a g-algebra on 7. 

8. Prove that 2* defined by (2.2) is the largest o-algebra for which r(x) is measurable. 

9. Show that Z c B* defined by (2.2) is a necessary and sufficient condition for t(x) 
to be measurable from Z(/) to 7(A). 

10. Define the partition of 7 induced by a statistic t(x) from Z( 2) to F(A). 

11. Prove that the four axioms for a probability measure hold exactly for the 
frequency ratio. 

12. Prove that the function F(x,,---,x,) defined in Section 3 satisfies the three 
conditions, (i)*, (ii)*, (iii)*, for a distribution function. 

13. In the notation of Section 3 prove that 


Pilas, bi] X ++ % Jans bal) = Az lar, b1) +> Neg Gny ba) F(E * + +, Ln). 


14. Show that the induced “measure” Q(B) defined by (3.3) satisfies the four con- 
dition for a probability measure. 

15. For the definition of the integral show that the limit (3.1) exists. 

16. If f(x), g(x) are real-valued measurable function over Z(.9/), prove that f (x) + 
g(x) and f(x) g(x) are also measurable. 

17. Prove that 


Í af (x) dux) = % Í SE) du). 
A A 


18. Prove that 


[ g(x) du(x). 


Í Lf (2) + g(a] du) = | S du) + 
A A J 
19. Prove that, if f(x) = 0, 
Sf) due) = 0. 
A 


20. Prove that, if f(x) = g(~), 


[ gŒ) d(x). 
A 


| fœ) dul) = 
A 
21. Prove that 


Í |s@+ solaz | | f(x) | dua) + | | g(x) | du(x). 
A A JA 


22. Prove that, if a < f(x) < b on the set A, 
aw(A) < | S(@) dule) < by(A). 
A 


23. If g(x) is a statistic from %(./) to YA) and f(y) is a real-valued statistic over 
YB), show that 
E{ f(¥Y)} = E{f(g(X))}, 


where Y = g(X) and X stands for the measure P(A) over 2( +7). Hence establish the 
uniqueness of the definition in formula (3.4). 

24. Prove that the binomial distribution has a probability measure which is absolutely 
continuous with respect to the Poisson distribution; which is not absolutely continuous 
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with respect to the normal. Show that the function N(A) = number of non-negative 
integers in A is a measure. Express the binomial distribution by means of a density 
with respect to N(A). Express the binomial by means of a density with respect to the 
Poisson measure. 

25, Prove that all probability measures given by a simple density function 


P(A) = f f(a) de 
a 


are absolutely continuous with respect to Lebesgue measure. 
26. Let ¥; = 1, 0 with probabilities, respectively, p,g = 1 — p, and let X4,°+-, X, be 
n 


=o 


independent. Find the conditional probability measure, given t(x) = > 

i=l 

27. Is the class of all probability measures on R' dominated? a 

28. Prove that, if Px(AlO) is a determination of conditional probability, given 

t(2) = t, then (i) P(T |u) = 1 ae. (Pp); (ii) 0 < Px(A[)) < 1 a.e. (Pp). (iii) for 
Ea 


A), As. +++ disjoint > ra |.) =Px ( U a:l r) a.e. (P,). 


i=l 


S . m 
29. Let X bea random variable with the distribution function F(x) -Í f (x) de. 
-0 


f(®) is called the probability density function with respect to Lebesgue measure, and 


by the Radon-Nikodym theorem the probability measure of X is absolutely continuous 
n 

te Lebesgue measure. If (Xut, 4 n) has the distribution function TI Fe) and 

n i=l 


density function TL /@ over R”, then prove that (¥,, *'', Xa) has the product 


i=1 : 
measure of the measure of X for each coordinate—the power-product measure. 
Consider a statistic t(x) = (aay, ** “s Zim), Called the order statistic, Where 2), °**, Vin 


are the values a, ++, 2, arranged in order of magnitude: Xay S++ < tim. Show 
that a determination of the conditional distribution is 


i(A,t) 
n! 


> palto) 


n! 


P(A) = 


where i(4,f) is the number of permutations of (£an ** s Tim) that belong to A, and 
$4(x) is the characteristic function of A C R”, P denotes summation over all the 7! 


Permutations f, and £, is a typical permutation of (ays Tin). 
30. For the class of distributions of the form given in Problem 29, use the Neyman 


Criterion to prove that the order statistic is a sufficient statistic. . 
31. Let X, = 1,0 with probability p,q = 1 — p, and let Xi, ++, X, be independent. 
n 


Find the conditional measure given t(x) = > a= 


i=l 
32. Show that r(x) in Problem 31 is sufficient for p E€ ]0, 1[. 
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33. Let X,, ++ +, Xn be independent, and let each be normally distributed with mean &, 


n 
and variance o*. Use the Neyman criterion to show that (z (2; — ar) is a 
sufficient statistic for ë E ]— œ, + œ[ and a? € ]0, œf. fai 
34. For the distributions in Problem 33 with = 0 and o? € JO, «[, show that Za? 
is a sufficient statistic. 
35. If Xa+ +, Xn are independent, and each has the Poisson distribution with mean 
n 
m, show that > z; is a sufficient statistic for m € ]0, of. 
1 
36. If X = (%,--+, Xy) are independent, and each has the binomial distribution 
with parameters n,p, show that Lz; is a sufficient statistic for given n and p € (0, 1]. 
37. If X = (X,, X) has a multivariate normal distribution over R? with zero means 
and covariance matrix 
+7 or 


> 


-r pr 


show that t(x) = 2, + z, is a sufficient statistic (o*) for the family of distributions 
(a°, 7°) E ]0, œf. 

38. If A A are independent, and each is normally distributed with mean & 
and variance o*, show that t(x) = Bz is a complete statistic relative to the class of 
distributions corresponding to § = 0 and c? E€ ]0, of. 

39. For Problem 38 show that r(x) = (@, L(x; — 2)*) is complete corresponding to 
($, o°) E€ Rt x JO, oof. 

40. Show that the Poisson distributions mé])0, ~[ forma complete class. Hence 
show that Zæ; is a complete statistic corresponding to the distributions in Problem 35 
with m € ]0, «[. 

41. Show that the binomial distributions form a complete class, n fixed and p E ]0, 1[. 
Show that Da, is a complete statistic corresponding to the distributions in Problem 36 
with p E ]0, I[. 

42. The hypergeometric distribution arises in acceptance sampling. Consider N 
objects, D with a certain property and N — D without that property. If each selection 


-1 
of n from the N objects has the same probability e ) of being chosen in a sampling 


procedure, then the probability distribution for the number Y of objects with the 


property in the sample is given by 
D\/N- D 
y)\n-y 


Pr (Y =y) = ( 
N 
(") 


ifyisan integer between 0 and min {D,n}, inclusive. Show that the probability measures 
corresponding to D = 0, 1,---, N form a complete class. 

43. Consider the class of absolutely continuous distributions over ]— œ, O[ x ]0, of 
(the second quadrant in R°). If (Xi, Y3), ++, (Xn, Yn) are independent, and each has 
the same distribution in the class above, then show that (x,y) = {(@1, Y1), ++ +s (En Yn} 
is a sufficient statistic. Use Theorem 7.2 to show that f(x, y) is complete corresponding to 
the class of absolutely continuous distributions. 

44. Prove Theorem 7.3. 
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CHAPTER 2 


Statistical Inference 


1. THE DECISION PROBLEM 


In Chapter 1 we introduced the concept of probability and considered 
the recording of the outcome of an experiment in a sample space, the 
frequency interpretation of probability in a sequence of repetitions of an 
experiment, and the summarizing of the outcome by means of a statistic. 
Also we considered two properties, sufficiency and completeness, which a 
statistic may possess in relation to a class of probability measures. These 
two properties will find considerable application in the later sections. 

In this section we shall sketch the model of decision theory. The 
different branches of statistical inference, estimation, hypothesis testing, 
confidence regions and tolerance regions, can all be considered as examples 
of decision theory. The present developments in nonparametric theory 
are rather special to the branches in which they occur, and seemingly the 
general theory has less to offer here than is the case for other parts of 
statistics. Consequently, we shall spend only a few pages formulating the 
general model and then in the later sections of the chapter develop the four 
branches as much in the style of the general theory as is possible. 

Suppose the statistician has chosen the sample space and the class of 
probability measures {P,|9 € ©} suitable to the experiment. By examining 
the experimental arrangement and the purposes for which the experiment 
was designed, he can consider the different decisions he may wish to make 
upon the completion of the experiment. This aggregation of decisions we 
call the decision space and designate it by Z; a particular decision we 
designate by d. The decision space Y will usually bear some simple 
relation to the space of probability measures Q; for the purpose of a 
decision is to try to say something about the unknown situation confronting 
the statistician, and as far as the model is concerned this situation is given 
by the parameter 0. Thus in effect the statistician wishes to say some- 


thing about the probability measure which produces the outcome he 
observes. 
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If the statistician has a plan which tells him what decision to make for 
each possible outcome of the experiment, he has what we call a decision 
Junction d(x) which for each outcome x of the experiment prescribes a 
decision d(x) in Z. A decision function is thus a statistic with values in 
LY. Once he has decided on a decision function, he has a definite plan of 
action to follow as soon as the outcome of the experiment is observed. 
The statistician’s objective in any experiment is to find a decision function 
d(x) which is in some sense good or best. 

The full aggregation of decision functions would be obtained by 
considering all possible statistics from 2 into 2. However, in many 
cases the statistician is willing to restrict his attention to some subclass of 
all decision functions. He may do this for various reasons—the mathe- 
matical complexity of the full class, the physical difficulty of applying a very 
abstract procedure, or practical expediency. The class of decision functions 
to which the statistician restricts his attention we designate by 2,. 

In some situations it is to the statistician’s advantage to take the out- 
come of an experiment and randomly choose a decision from 2, the 
probabilities for the different decisions being dependent on the outcome «. 
We now formulate this notion and then later in the section indicate the 
advantages by a simple example. In order to apply probabilities to 2, 
we need a o-algebra of subsets, say F over F; then GF) is a measur- 
able space. We now consider the class of probability measures over 
AF), designating it by M and a typical measure by m2). A 
randomized decision function m, is a function from T to M, that is, a 
statistic with values in.@. To use a decision function m, the statistician 
obtains the outcome x from his experiment, chooses the probability 
measure m,(D) given by the decision function, and constructs a random 
experiment which applies probability to 2 in accord with m,(D). From 
this constructed experiment he chooses an outcome, and this outcome, an 
element of 2, is his final decision. We designate by M, the class of 
randomized decision functions to which the statistician restricts his attention. 

Once the statistician has set up a model for his experiment, his main 
problem is to choose a decision function that is in some sense good or best. 
However, before he can judge decision functions, he must have some idea of 
the relative merits of the different decisions d for each situation 0 in which 
he might find himself. Accordingly, we suppose- the statistician can 
Measure the loss, perhaps economic, that he suffers in making a decision 
d in the situation 0. He thus obtains a loss function W(d, 0), and we 
require that W(d,0) > 0. Naturally, if d were the “correct” decision for 
some situation 0, we would expect the statistician not to suffer any loss; 


i.e. W(d, 0) = 0. 


The statistician can now examine the effect of a decision function d(x) in 
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the various situations 9. Unfortunately the loss he will suffer depends not 
only on @ and the particular function he uses but also on the value of the 
outcome x, a random variable. A standard procedure is to examine the 
expected or average loss, called the risk, and given by 


(1.1) Rac(0) = Ex{ WAX), 0)}. 

= [ W(d(x), 0) dP,(x), 

Jz 

or in the randomized case 
(1.2) R,,,(0) = [ [ W(d, 0) dm,(d) dP (2). 

Jazdo 

R Ra (®) 
Rare (0) 
—> 0 


Figure 2. The risk functions for two decision functions d(x), d’(x). 


By relating probabilities to frequency ratios this represents the average loss 
in a series of repetitions of the experiment. The subscript on Ra(0) 
denotes dependence on the function d(x) and not on particular values 
of the function. In some situations it may be quite unrealistic to use 
expected loss to judge a decision function. A very high loss with 
a small probability of occurrence may have higher or lower real value 
to the statistician than the expectation. Nevertheless much of the 
theory is based on average loss or risk. 

The statistician has a class of decision functions available, either 2, or 
«M, and for each decision function has a risk function R,(0) which des- 
cribes its behavior for the different values of 0. Ideally he would like 
a decision function that gave a minimum value to R,(0) for each 0, a 
minimum-risk decision function: that is, in terms of Fig. 2, a decision 
function having risk function beneath all other risk functions. However, 
it only occurs in a few special problems that the d(x) for which R,(0) is 
minimized for one @ is the d(x) producing a minimum for other values of 0. 
The theory of games provides us with a partial answer as to how to choose 
a decision function in other than these exceptional circumstances. 


2.1] THE DECISION PROBLEM 39 


The situation of the statistician faced with the choice of a decision 
function can be viewed as a two-person game. The statistician is faced 
with an unknown situation described by 9¢Q. For the purposes of the 
model it is convenient to think of this as having resulted from a first player, 
whom we shall call Nature, making a move 0 from a class of moves Q. The 
Statistician as the second player can make a move, the choice of a decision 
function d(z)¢Q,. The “pay-off” is the economic loss R,(@) to the 


Statistician. 
The solution proposed in the theory of games is based on the following 


type of argument. For a choice of decision function d(x), the loss will be 
R,(0) in situation 0. The maximum loss that can occur using d(x) is max 
6 


R,(0) [or sup R¿(0)]. The statistician chooses a minimax decision function 
0 
if his choice, say d*(x), minimizes max R,(0): 
0 
(1.3) max Rge(2)(9) < max Rae(9) 
0 0 

for any other decision function d(x). Thus the statistician for each 
decision function observes what is the worst average loss he can suffer 
using that decision function, and then he chooses the decision function for 


which this maximum risk is least. Such a choice of decision function can 
be said to give protection against the most adverse situation that might 


arise. 
Sometimes on the basis of the risk function we are able to state that one 


decision function d,(x) is to be preferred over another d,(z), 
if 
for all 0 in Q and 


R,,(0) < Ra, (©) 


Ra,(0) < Ru,(0) 


en we say dy(x) is better than d.{x). 

le of judging a decision function by means 
(x) is better than d(x), we would always 
The risk is never greater and in some 


for at least one 0 in Q, th 
So long as we accept the princip 
of average loss or risk, then, if di 
choose d, in preference to d. 


situations is actually less. See Fig. 3. 
A decision function is said to be admissible if there is no decision function 


that is better. Thus d(x) is admissible if there does not exist a decision 


function d'(x) for which 
Ry) < Rae) 


for at least one 0. A decision function d(x) 


for all 0 with strict inequality 
decision function. 


is inadmissible if there exists a better 
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A class C of decision functions is said to be complete if for every decision 
function not in the class C there is a decision function in C that is better. If 
in a problem we can find a complete class, obviously we can restrict our 
attention to the decision functions in that class, because for any decision 
function outside the class we can always do better within. 

It is easily seen that an admissible decision function must be contained in 
any complete class. For, if d(x) is admissible and is outside the complete 
class, then the “complete class” definition says there is a better decision 
function in the class whereas the “admissible decision function” definition 
says there just is not a better decision function. This is a contradiction. 


Ra, (0) 


Ra, (0) 


Figure 3. d(x) is better than d(x). 


EXAMPLE 1.1. Suppose a statistician is put in a very difficult position. 
He is given a weighted coin with the information that the probability p for 
a head is either 1/4 or 3/4, and he must decide on the basis of two tosses of 
the coin the value of the probability. The penalty for an incorrect 
decision is severe, say death. We have 


Z = {2H, 1H, 0H}, 


where 1H for example indicates the result of exactly one head in the two 
tosses. The probability measure P,(x) is given by 


{p®, 2pq, 4°}, 


where the three values in the brace give the probability measure of the 
corresponding points in the previous brace (g = 1 — p). Also 


o=: al 
4’°4 
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The decisions available are p = 1/4 and p = 3/4. We designate these 
decisions by 1/4 and 3/4, respectively; 

{1 3 

G=)\-,7}.- 
lara 


We shall consider two nonrandomized decision functions, say d(x) and 
d,(x) and a class of randomized decision functions {m3| €]0, 1[}. These 
are given in Tables 1 and 2. It is easily seen that m= is a compromise 


Table 1. Decisions using d(x) and d,(x) 


Decision =20 1H 0H 
Function 
3 1 1 
d(x) 3 r Z 
3 3 1 
d(x) 3 i j 


Table 2. Probabilities for the randomized decision function mg 


Randomized 


Decision 2 = 2H 1H 0H 
Function 
My (; ) 1 i 0 
l im 1 
al 0 
na (3) 


between the two nonrandomized decision functions in the sense that it 
can be described as choosing d(x) with probability « and d(x) = 
Probability 1 — a. For example miy(G/4) = % means that, when the 
Outcome is 1H, the statistician chooses the decision 3/4 with probability « 
and of course the remaining decision 1/4 with probability 1 = æ. i 
For each of these decision functions we could quite easily calculate the 
Probabilities for each decision when p = 1/4 and for each decision yhen 
P= 3/4. Such a set of probabilities would be called tae apanui 
Characteristic of the decision function. However, we shall intro uce a loss 
function and directly calculate the risk. Consider the loss function 
Wd.p)=1 if d#P 
=0 = 
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The loss is one unit, in this case one statistician, for an incorrect decision 
and no loss for a correct decision. 

By using the probabilities, the loss function, and the table of decision 
functions, we can straightforwardly calculate the risk. For instance: 


rala) B (7)'+ 02: (5) () +0: G) 
Ra()=0 -040-03 


+044- a): 1123) (3) 


1\2 
+04 1-00) 
ami 

© l6 


Making a table of the risk for d,(x), d,(x), m* and the particular case m°*, 
we obtain Table 3. With either of the nonrandomized decision functions 


Table 3 
1 3 
R (3) R (;) DAER 
1 7 7 
4 = ri = 
2 16 16 16 
7 1 7 
a EN ch ria 
a, 16 16 16 
o 1 + 6z 7 — 6x 
16 16 
is 4 4 4 
16 16 16 


the risk can be as high as 7/16; that is, 7, 


i /16 chance of losing the statistician. 
However, usin 


g the randomized “strategy” with « = 1/2 


Therefore we say that mz” is a minimax deci 


such a decision function re 
against adverse situations. 
randomized decision functio 


nreduced the maximum risk from 7/16 to 4/16. 
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EXAMPLE 1.2. We consider a simple problem of sampling from a normal 
distribution. Let 2 = R? and an outcome be a sequence of nine real 
numbers. Also let the probability distributions be those corresponding to 
sampling from a normal distribution with mean yw and variance 1. The 
parameter space is Q = {u}= R!. We have 


(1.4) P(A) ii ay exp [ > (@; | I az. 


Suppose the statistician is interested in only two decisions: the decision 
u < 0 designated by d, and the decision 4 > 0 designated by də. 

Very often in problems of this sort d, represents the sort of standard 
situation present in similar problems in the past—the status quo—whereas 
d represents a change from the past which might exist in the present 
problem. A familiar practice among statisticians is to restrict the class of 
decision functions to those that give a certain level of protection in the 
standard situation. The usual restriction is that the decision function 


d(x) must have 


Pr, (X) = dy} <a 


when y <0. Then the probability of an incorrect decision is bounded by 
« (say 0.05) if the standard situation exists in the problem. 
A decision function satisfying this restriction is 


d*(x) = d. if Ir, 1.64 
(1.5) @=¢ if Bo 
1.64 

= aq oor" 


We can easily check to see if the restriction is satisfied. The induced 
distribution of the statistic is normal with mean mu and variance 1/3. 


Therefore 


EX; _ 1.64 
Pr (at(x) = dy} = Pr [= a 


=X, 1.64 =i 
i > L3H 
on a 


-r| 


= Pr {Z > 1.64 — 34}, 


where Z stands for a random variable with the normal distribution, mean 
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zero, and variance 1. From tables of the normal distribution, the above | 
probability is 0.05 when u = 0 and is less than 0.05 when (B= 0, m 

In problems such as this the operating characteristic of a decision | 
function has a particularly simple form. Since there are only two decisions, 
and a decision must be made at the completion of the experiment, it 
suffices to give the probability for one of them which by tradition is dy. 


1.00 


“epn E zi J 
-2 I 0 +1 +2 F3 x 


Figure 4. The power function for the decision function given by (1.5). 


This probability of a decision dy is 


given a special name, the power function, 
and is designated by P(x): 


Pau) = Pr, {d(X) = dy}. 


For the decision function defined above, 
(1.6) Pas (u) = Pr = = =) 


= Pr {Z > 1.64 — 3p}: 


This is plotted in Fig. 4. The restriction on the decision functions is that 
the power P.() should be less than 0.05 when # <0; that is, the power 
function curve should remain beneath the sketched line segment at height 
0.05. 


5 
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Consider a simple loss function such as 
Wida p)=1 if u<0 


=0 >0 
Wd, ) = 9 if w<0 
=a +0: 


The loss for an incorrect decision is 1 when 4 < 0 and is a when uw > 0. 


1.00 


0.05 H——— Se 4 ; ti 
—2 -1 0 +1 +2 +3 x 


Figure 5. The risk function for d*(x) when a = 1/2. 


The risk can be given simply in terms of the power: 


RO) = PA) if w<0 


= a[l —P() > 0. 


For all values of x < 0 the restriction has given the protection that the 
Probability of a wrong decision is kept below 0.05. Therefore it seems 
reasonable to examine the risk function only for values of u> 0. If we 
do this, then it can be shown, and will be in Section 3, that d*(x)is uniformly 


better than any other decision function satisfying the restriction. 


ision theory we give a theorem which says 


> To complete this section on dec em s 
need consider only those decision functions 


that in any problem the statistician 
that are based on a sufficient statistic. 

statistic for {Po|0 EQ}, then, for any 
omized decision function based on t(x) 
g that the conditional probability is 


Turorem 1.1. If (2) is a sufficient 
decision function d(2), there is a random 
which has the same risk function (assumIn; 


a measure). 
n function based on the statistic r(7). 


Proof. domized decisio 
f. We define a ran ure P(A|‘) over © and the statistic 


Consider the conditional probability meas 
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d(x); take the randomized decision function m,(D) to be the probability 
eres over 2 induced from P(A|t) by the statistic d(x). Then we have 


Rn (0) = I | wa, 0) dm(d) dPT(t) 
R FIG 


Í | W (d(x), 9) dP(x|t) dP2(t) 
FIX 


= f W (d(x), 0) dP œ) 
JX 
= Raz(9). 
For further reading on decision theory see Wald [1], and on the two-person 
game see Von Neumann and Morgenstern [2]. q 


2. THE ESTIMATION OF REAL PARAMETERS 


2.1. Introduction. In Chapter 1 we used the term parameter for a 
symbol which indexed a class of probability measures. It has, however, a 
more general interpretation as a quantity calculated from a probability 
measure and therefore characteristic of the measure. We define a para- 
meter to be a function g*(P,) defined over a class of probability measures 
or equivalently to be a function 8(9) defined over the parameter space Q. A 
real parameter g(0) is a real-valued function over Q. A vector parameter 
g(0) = (g,(0), «+, &A0)) is a vector-valued function over Q, taking its 


values in a real space of k dimensions, R®. Thus a vector parameter has 
coordinates which are real parameters. 


For the class of normal distributions on 
variance o®, we have 0 = (u, o?) which is an element of Q = R x JO, oof. 
As examples of real parameters consider H, œ, E,{X?}. As examples of 
vector parameters consider (u, o), [Enct{X}, Eyce{X?}, E,,,¢{X}], where 
the latter can be more simply written [u, u? + 2, 18 + 30]. 

We confine the theory of estimation to a consideration of real or vector 
parameters. Now for applications the statistician wants a statistic that 
will produce from the outcome a real- or vector-valued quantity as an 
“estimate” of the parameter being considered. He wishes the value of the 
Statistic to be close to the value of the parameter or as close as is possible 
considering the randomness in the experiment. These ideas we now make 


Precise by imposing restrictions on the general model introduced in the 
previous section. 


Let the parameter of interest be g0) = (g,(0), - - 
Statistician obtains an outcome from an experim 


the real line with mean jz and 


5 g,(0)). If the 
ent characterized by 6, then 
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he wishes to calculate from the outcome a possible value for the parameter 
g and he hopes that this “estimate” is close to the true value g(9). 
Accordingly the space of decisions must be a real space of k dimensions, 
@G= R*. Also a decision function must have the form, d(x) = (d(x), 
+++, d,(x)), where d,(x), +++, d,(«) are real statistics. The loss function is 
then used to formalize the requirement that the estimate be close to the 
parameter. We require that 


2.1) W(a, 0) < Wd + Xd — g0)),0) 


where d is a vector and by d—g we mean the vector difference, 
(d, =p d — g,). The above inequality means that the loss for the 
decision d must be less than or equal the loss for the decision d + e? (d — g) 
which is a proportion e? farther away from the parameter value g. We 
shall also use the term estimator for a decision function in an estimation 
problem. 

From the decision theory it follows t 


must be a statistic taking its values in t 
Over R", As yet, randomized estimators have found little application in 


the standard problems, and we shall have a theorem later in the section 
which supports this result. Asa practice, however, we should be prepared 
to examine the larger class consisting of randomized estimators in the hope 
of perhaps finding a better estimator there. Theorem 1.1 in the preceding 
section says that, if we restrict attention to estimators based on a sufficient 
Statistic, then in general we must examine the randomized estimators if we 
wish to cover all the possibilities of the nonrandomized estimators based 
on the original outcome. See Problem 3. 

The general theory suggests that, in looking 
try to find one for which the risk 


(2.2) R,(0) = I. Wala), 80) dP) 


ch 0. We give a simple 
he most trivial problems. 


hen that a randomized estimator 
he space of probability measures 


for an estimator d(x), we 


usly for ea 
occurs in t 


EXAMPLE 2.1. Let the parameter g(0) be real-valued, and assume that 
minimum value for each 0, say for 


the loss function W(d, 0) has a single n | f 
example W(d, 0) = (d — 2(0))- Then the trivial estimator, d(x) = do, 
which ignores the value of the outcome, has minimum risk for all the 6’s 


having g(6) = dy: 


1S as small as possible simultaneo 
example to indicate that this only 


Ry(O) = | — 80O Pu 


= (dy — g0)’. 
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For the 6’s for which g(9) = do, we have R, (9) = 0; and for other O's 
the risk is greater than zero. A short analysis will also show that this is 
the only estimator having minimum risk at the @’s with g(0) = dọ. Thus 
we can minimize the risk for certain 6’s but not for all simultaneously. An 
estimator having uniformly minimum risk does not exist. : 

We are thus faced with the quite general nonexistence of uniformly 
minimum risk estimators, at least within the full class of estimators. 

The minimax principle offers an alternative criterion on which to base 
our choice of estimator. The desirability of an estimator is measured by 
the maximum risk it encounters: sup R,,,)(9). A minimax estimator is 

0 


then one for which sup R,,)(9) has its minimum value. With the simple 
0 


loss functions such as squared error, W(d, 0) = (d — g(0))?, used in the 
standard problems, it can happen that the maximum risk is -+00 for all 
estimates. Then any estimate is minimax. However, simple modifica- 
tions in the loss function often produce better-behaved risk functions, and 
consequently make the application of the minimax criterion more sensible. 
The minimax criterion has been successfully applied to some of the simple 
parametric problems involving normal distributions. For an interesting 
paper on this field of estimation, see Hodges and Lehmann [3]. However, 
it has not had extensive application in parametric problems, and there are 
few indications of extensive application in the nonparametric field. 

Another approach is to place moderate and reasonable restrictions on 
the estimators under consideration, and thus reduce the class of estimates 
2, in the hope of finding a minimum risk estimator in the smaller class. 
One such restriction is to require that an estimator be unbiased. 

An estimator d(x) is an unbiased estimator of (9) if 


(2.3) Ey{a(X)} = (Enfdi(X)}, «+ +, Eofd,(X)}) 
= (g,(9), + +, g,(8)) 
= g0) 
for all0 EQ. 


It is to be noted that the expectation of a vector is obt 
expectation of each coordinate. 

Unbiasedness has an experimental interpretation which supports the 
statement that it is a reasonable Tequirement to place on anestimator. In 
a long series of repetitions of an experiment, the average of the different 
values assumed by d(x) will be close to the value of g(0) with high pro- 
bability. This follows from Khintchine’s theorem to be given in Chapter 
6. In the present notation the theorem says that, if E,{d(X)} exists as is 
implicitly assumed in formula (2.3), then the probability that the average 


ained by taking the 
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isina i 
by hare a neighborhood of g(#) can be made arbitrarily close to one 
thin PE a large enough number of repetitions. A stronger version of 
aan oi wer theorem states that, with probability one, the sequence of 
ges obtained by taking more and more repetiti i 
ions w. 

EAO) g pi s will converge to 

T A ao P 
Pa — is another definition of unbiasedness which from some points of 
k ie as reasonable as or more reasonable than that above. However it 

i ies only to real estimators. 

n estimator d(x) is a median unbiased estimator of g(0) if 
£0 Ya 
£05 (U(X) = 8) 

where £? A ; istributi 

where Et 5 (d(X)) is the median value of the distribution of d(x) obtained 
5 i om the measure 0. 
oi symbol &,, can also be defined for all values of p between zero and one. 
i J speaking, it is the point in the distribution of a real-valued 
a om variable that has probability p to the left of it and 1 — p to the 

ght of it. Precisely, &,(¥) for a real-valued random Y with distribution 


ur i oy ds nd if 
Junction F(x) is the real number satisfying 


(2.4) ** FQ) = FE) = P- 


J- 


Tf the P E 
If the solution to the equation is not unique, &, is taken to be any one of the 
mbiguity will be un- 


OSS i. ‘ m 2 
em values; in any of our applications the a 
Portant. Jf the equation does not have a solution, £, is chosen to satisfy 


F(é, — 0) <p < FÈ). 
tion in estimation theory 


Media ; P i 
edian unbiasedness has found little applica 
f to the mathematical 


ite because it does not seem to lend itsel 
alysis needed to find minimum risk estimates. 

In Section 2.3 we shall consider briefly another reasonable restriction to 
Place on the class of estimators. It is the restriction to estimators 
Satisfying a property of invariance. 

In this section we consider some theory and 


2.2 Unbi ge 
ased Estimation. 
fy the restriction of unbiased- 


m i : 
€thods concerned with estimators that satis 


n - 
ess as given by formula (2.3). ' 
First we give some examples of loss functions which are frequently used 

With the property of unbiasedness. If g(0) is a real parameter, consider 


(2.5) W,(d, 0) = [d — OP 
wd, 9) = |d— 80|” 
wld, 0) = Wold) 


if p=! 
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where W,(d) is a convex function of d for each value of 0. For this we 
define ; 
W(d) is a convex function of d over RY if, for each d, a’, and v € ]0, I[, 


(2.6) aW(d) + (1 — a) Wd’) > Wad + (1 — a)d'). 


The function is strictly convex if the inequality is always strict. 
For an illustration of the definition see Fig. 6. It is easily seen that 
W,(d, 0) and W,(d, 9) are strictly convex functions of d and hence that they 
are particular cases of W,(d, 0). 


Ww 


aW (d) + (1-a) W(d') 


W(ad +(1-a)d’) 


d ad +(1—a)d’ d’ 


Figure 6. Illustration of a convex function. 


For unbiased estimators the risk functions corresponding to W, and W, 
have a familiar form. With W, we obtain 


RO= | tale) — gO dP) 


ll 


f (d(x) — E (4X? dP yfe) 
= oux(0), 


which is the variance of d(X) corresponding to the probability distribution 


6. With W,, the risk function becomes the pth absolute moment of d(X) 
about its mean. 


For vector parameters we 
functions: 


k 
(2.7) Wa, 9) = > Id; — (OP 
i=l 


W,(d, 0) = Wad) 


where W,(d) is a convex function of the vector d. For the first of these loss 


give the following two examples of loss 
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functions the risk function takes a particularly simple form. We have 


k 
R,(9) = Ep > (a; — so) 
i=l 


k 


= = az x0), 


i=1 


where oj,,y)(0) is the variance of the coordinate d,(X) when the distribution 
is given by 0. 

There is a property not based directly on the idea of risk which can 
sometimes be attained by unbiased vector estimators. It was introduced 
by Crameér [4], p. 300 and p. 491. Let o;,(0) be the covariance matrix for 
the coordinates of d(X) = (d,(X), + + +, d,(X)) when the probability measure 
is given by 0: 


(2.8) 044(8) = Eg{ld(X) — gO) — 801} - 


Also, assuming that the matrix ||;,(9)]| is nonsingular, we let || o*40)|| be 
the inverse matrix. Then Cramér defined the ellipsoid of concentration 
for d( X) to be 

k 
(2.9) > Ovary =k +2 


ij=1 


where, for convenience, we have let y; = d; — g;(0). This ellipsoid has a 
rather simple interpretation. For, if we consider the multivariate normal 
distribution in R! having the same mean and covariance matrix as d(x), 
then the quadratic form on the left-hand side of equation (2.9) is, except for 
a factor —} the exponent in the multivariate probability density. The 
ellipsoid given by the equation is then a surface of constant probability 
density which in a certain sense outlines or displays the probability 
distribution. Also, if we consider a probability distribution that is uniform 
within the ellipsoid, then it will have the same means, variances, covariances 
as the given distribution; this is the justification for the constant k+2on 
the right-hand side of (2.9). Following Cramér we say that an unbiased 
estimator d(x) has minimum concentration ellipsoid if (2.9) is contained in 
the ellipsoid of concentration for any other unbiased estimator. 

The remainder of the theory using risk functions will depend on the use 
of convex lossfunctions. Convex loss functions are open to some criticism. 
Frequently in application the statistician wants his estimate to be close to 
the parameter, but, if it is far from the parameter, then a little farther does 
not matter much. The convex loss does not exhibit this property. For 
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consider a single real parameter g(0), and assume that the loss function 
W,(d) is not asymptotic to the d axis in either direction. Then it follows 
easily from the convexity that, for any decision d far from the value g(0), 
the additional loss in going to the value d + e (or d—e) farther from 
g(0) will be at least as large (larger if strict convexity) as the additional 
loss when d is closer to g(@). 

Before commencing the theory itself for unbiased estimation, we develop 
a few results concerning convex functions and ellipsoids of concentration. 


wh 


W(y) 


(Yos Wy.) = (Y0; LQ) 


L 


L L >y 
Figure 7. A strictly convex curve and a line of support. 


THEOREM 2.1. A convex function defined over the real line is a 
continuous function. 


Proof. This is given as Problem 4 at the end of the Chapter. 


THEOREM 2.2. Through any point on a convex curve defined over the 
real line there passes a straight line which stays beneath the curve or at 
most touches it. Such a line is called a line of support. If the curve is 


strictly convex, then the line of support is strictly beneath the curve except 
at the one point of contact. 


Proof. SeeFig.7. Let (Yo, W(Yo)) bean arbitrary point on a convex curve 
Wy). Also let y_, y,, y”. be three points satisfying y_ < yy) < yy SY 
We consider the slopes of the lines joining the points above Ys Yas U to 


the point (yo, W(yo)). It is easily seen that the line joining the points on 
the curve above ył and Yo has slope 


wy) — W(yo) 
fee eee OP 


¥+—Y%o 
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which is greater than or equal the slope 
Wy.) — WY) 
Y+ — Yo 
for the line joining the points above y, and yọ. For, if not, then the slope 
inequality could be rearranged and would violate the convexity property 
applied to the points yo, y* with intermediate point y,. Similarly, the 
slope of the line for y, and yo is greater than or equal the slope of the line 
for y and to y. 
From the above argument it follows that the slope 
Wy) — Wyo) 
y= Yo 
is decreasing (or at least nonincreasing) as y decreases towards yọ Also 
the values of the slope for y > Yo are bounded below by 


Wy) — WY) | 
=e 


Therefore the derivative to the right, say Dr exists at Yo: 


Wy) — Wyo) 
(2.10) Dp= lim Wy) — Ww, 
i y$ vo Yy — Yo 


where y | yọ means that y approaches yo from the right. Similarly the 


derivative from the left exists, ) 
Wy) — WY 
(2.11) D, = lim ee 
vin YT% 


These two derivatives satisfy Dz < Dr because we have proved that each 
clement for the right-hand side of (2.10) is greater than or equal to each 


element of the right-hand side of (2.11). 
Take D, to be any number satisfying Dz, = Dj Dy, Wenow prove 
V(yp) is a line of support at 


that the straight li = Diy — yo) + W 
aight line L(y) = DoY — Yo 
(Yos W(y)). First; the line obviously passes through (yo, W(Yo)). Second 


We prove W(y) > L(y). For any point Y+» the inequality Dr = Do gives 
Zi 
Wy.) — Wo) > Dy. 
Ym Yo 
By rearrangement 
Ws) = Do Ys — Yo) + W(Yo) 
> L(y). 


Similarly the inequality holds for any point y_ to the left of Yo- 


2 
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The last statement in the theorem concerns strict convexity, and it 
follows easily by checking a few of the details in the above argument. 


By an extension of the above proof, we obtain 


THEOREM 2.3. Through any point yọ on a — function W(y) 


defined over R*, there passes a hyperplane L(y) = > la: + l which 
satisfies i=l 


(2.12) Wy) > L(y). 


If the function is strictly convex, then (2.12) is strict unless y = yp. 


W(E(X)) =L(E(X)) 


> 
E(X) z 


Figure 8. In illustration of Theorem 2.4. 


The next theorem proves as special cases a number of fami 


liar inequalities 
on expectations. 
THEOREM 2.4. If W(x) is a convex function and X is a real random 
variable, then 
(2.13) E{W(X)} > WE(X); 
and, 


if W(x) is strictly convex, the inequality is strict 


(2.14) E{W(X)} > W(E(X)) 


unless X has all its probability at one point. 
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Proof. See Fig. 8. Let L(x) be a line of support to W(x) at 
[E{X}, W(E{X}]. Then W(x) > L(x) and E{W(X)} > E{L(X)}. But 
the expectation of a linear function is the linear function of the expected 


value; therefore 
E{W(X)} > E{L(X)} 
= L(E(X)) 
= W(E(X)). 
If W(x) is strictly convex, then, from Theorem 2.2, L(x) < W(x) unless 
x is the common point which in our case is E(X). We have 


E{W(X) — L(X)} > 0, 


unless W(X) =L(X) with probability one. The theorem follows by 
noting that W(x) = L(x) implies «= E{X}. 

A familiar example of this lemma is the following inequality for a 
real-valued nondegenerate random variable 

E(X?) > (E(X). 

Also we have 
E{|X — EX|"}> |E{¥ — EX}|" 
if r>>1. For the theorem these two examples use, respectively, the 


convex function x? and jæ] with r > 1. 
The next theorem will be of use when we consider estimators having 


minimum concentration ellipsoids. 
THEOREM 2.5. If ||o,j||, || o%)|] are positive definite matrices and if 
|| o"||, || o#|] are the corresponding inverse matrices, then 
k k 

(2.15) > elds 2 > oğli, 
for all /,, +++, J, implies ps e 
(2.16) > old; < > oft, 

iJa ga 


for all },**', l} Also identical equality in (2.15) implies the same in 


(2.16). 
Proof. From the theory of matrices we know there exists a nonsingular 


matrix A such that 
A'|oi|4=4 


where A’ is the transpose of A and T is the identity matrix. Also for any 
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matrix there exists an orthogonal matrix which will diagonalize. Let O 
diagonalize A’||o,;||A; then 
O'A'\\o,;|| AO = D 
dı (0) 
(0) dy 
Letting AO = B and noting that O’JO = I, we obtain 
(2.17) B'||o%||B=7, 
B'|\o,;|B = D. 
If] = (h, + + *, l), the assumed inequality in the theorem may be written. 
"fosi = roj]. 
Substituting 1 = Bt and using (2.17), we obtain 
ťDt>ftťit, 
which may be rewritten 
(2.18) Ld? > Xe. 
Since (2.18) holds for all 4, ++ +, fy we have d; > 1 (i= 1, +++, k). There- 


fore we can write, for all G 


Idr tt; < IP 
or 


CDL EH. 
Now, if we make the substitution t = 
the one above!), we obtain 
VBD“B'l < BB’, 
Then, since 
Il a|] a [B'-1DB-]-1 
= BD-"B’, 
and 
lol = wey 
= BB’, 
we have the required inequality 


Mo" <r 


, ! orl 
or in summation form 


Xll? < Th oi) 


B'I (which is not the inverse of 
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The last statement in the theorem is obtained by noting that equality is 
equivalent to > and < holding simultaneously and then applying the first 
part of the theorem for each direction of inequality. 

We now develop some theorems which provide a constructive procedure 
for improving estimators and under certain conditions a method for 
obtaining minimum-risk unbiased estimators. 


THEOREM 2.6. (RAO-BLACKWELL). If t(x) is a sufficient statistic for 
{P|0 € Q} over 2( e/) and if f(x) is an unbiased estimator of g(6), then 
A(t) = E{f(X)|¢} is an unbiased estimator based on ¢(x). The inequality 
03(0) > o3(0) holds unless f (x) = h(#(x)) almost everywhere (Ps). With a 
strictly convex loss function the inequality R,(9) > R,(0) holds unless 
S (x) = h(t(x)) almost everywhere (Po), in which case R,(9) = R,,(9). 


Proof. The part of the theorem concerning the variance inequality 
follows as a particular case of the next part by using the strictly convex loss 
function W(d, 0) = (d — g(9))?. 

t(x) is a sufficient statistic. Therefore the conditional probability is 
independent of 0 and A(t) which is the conditional expectation of f (x), 
given z(æ) = t, does not depend on 9. This proves that A(¢) is a statistic 
defined over the space of t(x), say 7. We now show that it, also, is an 


unbiased estimator of g(0). 


ll 


E,{h(T)} fa 1) dP#(t) 


ll 


[ I fiw) aP(x| t) APS (2) 
JIJI 


ll 


f fle) dP) 
v= 
= Ej {f (X)}. 


pectation of f(x), and therefore the 
expectation of A(t) is just the over-all expectation of f(x). The above 


equations assume that the conditional probability is a measure, but, if we 
use directly the definition of conditional expectation, the result remains 


valid in general. 
We now consider the condition 
Statistic t(x). For f (2) we have 


EWEX), 9)| 1}, 


In short, A(t) is the conditional ex 


al expectation of the loss, given the sufficient 


and for h(t) we have 
o) wW(h(t), 9). 
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Now, applying Theorem 2.4 and remembering that h(t) = E{f (X) |t}, we 
obtain 


EWEX), O)|} > WHO), 0) 


with equality only if f(X) is equal A(t) with conditional probability one. 
Now, taking the marginal expectation with respect to the distribution for 
t, we have 
EL W(f(X), 0)} > ExfWH(T), 0)} 
or equivalently 
R,(0) = R,(9), 


with equality only if f (X) = A(t(X)) with probability one. This completes 
the proof under the assumption that the conditional probability is a 
measure. However, by combining the proof of Theorem 2.4 with the sort 
of argument above, the theorem can be proved without this assumption. 


COROLLARY 1. If the loss function is convex, R,(0) > R,(0) for all 0. 


Proof. The proof is a minor alteration of the second part of the proof 
above. 


CoroLtary 2. Ifthe loss function is strictly convex, then the estimators 


based on the sufficient statistic form a complete class of decision functions 
for the estimation of g(0). 


Proof. The proof follows directly from the definition of complete class 
and from the statement of the theorem, provided we call f (x) a function of 
u(x) if it can be written f(x) = h(t(x)) almost everywhere {P,|0 e Q}. 


Corottary 3. If the loss function is convex, then for any randomized 


unbiased estimator, there is a nonrandomized unbiased estimator with 
smaller or equal risk. 


Proof. The proof follows by 
its conditional expectation, gi 
of proof for the theorem. 


replacing the random choice of estimate by 
ven the outcome and following the pattern 


EXAMPLE 2.2. Let Ap © 2 =X, be independent and each have the same 
absolutely continuous distribution on Rt, and let the class of probability 
measures correspond to all the absolutely continuous distributions on Ri, 
Then, freely speaking, we have a sample of n from some absolutely con- 
tinuous distribution on R1, Problems 29 and 30 in Chapter 1 are to 
prove that ¢(x) = (£a +++, Etn), the order statistic, is sufficient. 
now the estimation of E(X), the mean of the absolutely continuous 
distribution. Obviously f(x) = 2, is an unbiased estimator, but, since it 
ignores £a, ** +, x, we should expect to be able to find a bette 
We apply the Rao-Blackwell theorem and calculate h(t), the 


Consider 


r estimator. 
conditional 
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expectation of x}. The conditional probability, given the order statistic, 
assigns equal probability to each of the n! permutations of (Xu), ** * ®n))5 


therefore (n — 1)! 
Pr {X= %)|4@)} = 


n! 


i 
a n 
and 
1 1 
(2.19) h(t) =aqy H +m t= 
n n 
Er 
© n 
Er; 
7 
=f 


The Rao-Blackwell theorem then says that @ is unbiased and that it has 
smaller variance and smaller risk (strictly convex loss) than 2. 

Similarly for the estimation of E(X?) the Rao-Blackwell theorem says 
that Xæ?/n is unbiased and has smaller variance than the statistic 27. 

There is one detail we have overlooked. The parameters E(X) and 
E(X,) do not exist for all the probability measures of this example. This 
frequently happens in nonparametric problems, and we shall consider it 
further in Chapter 4. For our example here, the results obtained are 


valid. 


EXAMPLE 2.3. Let Xp 't' Xn be independent and each have the 


normal distribution with x and variance one, and let the class of probability 
Consider the estimation of E(X*) = 


measures correspond to all x € Rt. Pki 
u? +1. By example 5.1 in Chapter 1, we know that (x) = n1La; = 7 is 
a sufficient statistic. Also from the previous example we know that 
nD? is an unbiased estimator for all absolutely continuous distributions 
on R! and so certainly for the normal distributions here. We apply the 
Rao-Blackwell theorem, noting that La? = De) t nē? and that the 
distribution of X is independent of the distribution of E(X; — ¥)?: 


EN XX, — 0° | 
PH ae, PO me 


n 


n 


w pa = z pp 


ateka 
n 
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This is then unbiased for the distributions of this example and has smaller 
variance that n£z}. 


We have now for vector estimation an analog of the Rao—Blackwell 
theorem. 


THEOREM 2.7. (BLACKWELL—-LEHMANN-SCHEFFE). If ¢(x) is a sufficient 
statistic for {P,|9 € Q} and if f(x) is an unbiased estimator of g(0), then 
h(t) = ECX) |t} is an unbiased estimator based on ¢(x). With a strictly 
convex loss function, the inequality R,(0) > R,,(0) holds unless f(«) = h(t(x)) 
almost everywhere (P,), in which case R,(0) =R,(0). The ellipsoid of 
concentration for h is contained in the ellipsoid of concentration for f with 
equality of ellipsoids only if f(x) = h(t(x)) almost everywhere Py). 


Proof. Most of the proof duplicates that for the previous theorem but 
uses vectors instead of real numbers for f and h. The part needing 
special proof concerns the ellipsoids of concentration. 

Since f(x) is an unbiased estimator of g(9), any linear combination of the 
coordinates of f(x), say XI; f;, is an unbiased estimator of the same linear 
combination of the coordinates of g(0). By Theorem 2.6 we have that 


ERI S(O)|1} = ELELO, i} 
= SI, h(t), 


is an unbiased estimator of =/; g(0) and has smaller variance unless 


21, f(X) = ZI, h(t(X)) with probability one, in which case the variances 
are equal. 


Let f(x) and h(z) have, respectively, the covariance matrices ||o:(0)|| and 
lä]. Now, since the variance of X; f(x) is Xl} c;(0) and of £L; h4) 
is X/,1; o}(0), we can write the variance inequality 

Zll; o4(0) > Ell; o%(0). 


This inequality holds for all ,, «++, lą. Then, assuming that the matrices 


are positive definite and applying Theorem 2.5, we obtain 
Xll; o”(0) < SII, o(9) 

for all l, +++, l Now, for the ls +*+, lp for which 

(2.20) Xll, o(0) =k +2 

we need 6 < 1 in order that 


X ôl; él, o(0) = k +2. 
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This obviously implies that the ellipsoid (2.20) contains the ellipsoid 
Xll, off(0) = 2k + 2. 


The equality of ellipsoids means that the inequalities above should all be 
identities, and this means that f() = h(?(X)) with probability one. 

If the matrices are singular, then certain of the linear combinations will 
have zero variance, and this implies that all probability is in a linear 
subspace. The preceding argument can be applied in such a subspace. 
The constant & + 2 gives the proper relationship between ellipsoids in one 
space and the related ellipsoids in a subspace. The details of this argument 


are quite straightforward. 


The next theorem we consider helps us to avoid for many problems the 
sometimes tedious calculation of conditional expectations, and it produces 
unbiased estimators with minimum risk (convex loss). First, a definition: 

A parameter g(0) for {Po|0 € Q} over 2( A) is estimable if there exists 

a statistic {(«) such that E,{f(X)} = g0) for 0 € Q; that is, if there exists 

an unbiased estimator for it. 

THEOREM 2.8. (LEHMANN-SCHEFFÉ). If there is a complete and suffi- 
cient statistic s(x) for {Po|0 € Q}, then every estimable real parameter g(0) 
has a unique unbiased estimator with minimum variance and minimum 
risk (strictly convex loss); this estimator is 
which is a function of t(x). 


the only unbiased estimator 


of the theorem we know there is at least 
By the Rao-Blackwell Theorem 2.6 we 
ST (%), there is an unbiased estimator 


Proof. From the assumptions 
One unbiased estimator of g(9)- 
know that, for any unbiased estimator i nun 6 
h(t) based on z(x) such that R,(0) = R,(0) with strict inequality for at 
least one 0 unless A(x) can be written as a function of t(x) almost everywhere 
for each 0. Therefore, in looking for minimum-risk estimators, we can 
restrict our attention to the unbiased estimators based on t(x). 

If there are two such estimators, say /y(?) and A(t), then 


E,{h(T)} = gO)» 


Eq{ha(t)} = g(9), 


and therefore 
Esh (T) — hafT)} = 80) — (9) = © 


of the statistic t(x), we have A(t) — h(t) = 0 


ae 
almost everywhere with respect to each of the measures r A Thus a 
unbiased estimator based on (x) is essentially unique an asi smaller 


Variance and risk than any other unbiased estimator. 


But, from the completeness 
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EXAMPLE 2.4. Let Xj, +--+, X,, be independent, let each have the normal 
distribution with mean x and variance 1, and let u € R!. From Chapter 1 
we know that @ is a complete sufficient statistic. 

Consider the problem of estimating the real parameters: u, ?, 
E(X*) = w? + 1. All we need is to find an unbiased estimator based on 
z, and it will have uniformly minimum variance and uniformly minimum 
risk for any strictly convex loss function. 

For 4, ē itself is the obvious estimator and, of course, the only one that 
is a function of 2. 

For u°’, let us calculate the expectation of 2. 


E {2°} = [E,(X)) + o2{X} 


therefore 


z,{R — 1) = 2, 


n 
and z? — 1/n is the unique minimum variance estimator. 


For E(X?) = u? + 1, we obviously take #2 


+ (n — 1)/nas the minimum- 
tisk estimator. 


For vector parameters we have the followin 


g extension of the Lehmann- 
Scheffé Theorem 2.8. 


THEOREM 2.9. (LEHMANN-SCHEFFE). 
cient statistic z(x) for {P,|0 € Q}, then eve 
a unique unbiased estimator with mini 
minimum risk (strictly convex loss); 
estimator that is a function of t(x). 


If there is a complete and suffi- 
ry estimable vector parameter has 
mum concentration ellipsoid and 
this estimator is the only unbiased 


Proof. The proof is a vector analog of that for the previous theorem. 


EXAMPLE 2.5. Let X;,---, Xn be independent, and let each X; have 


the normal distribution with mean # and variance o%. We consider all 


distributions corresponding to u € R? and o?e€]0, œf. By Problems 33 
and 39 in Chapter 1 we know that (Z, E(x; — #2) is a complete sufficient 
statistic. 


For the estimation of the parameter (4, o?) it is easily seen that the 
1 


T 2G — 7)? is unbiased, Also it is a function of the 
complete sufficient statistic. 
the unbiased estimator with 


Statistic (z, 


By the Lehmann-Scheffé Theorem 2.9 it is 
minimum risk for any strictly convex loss 
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function. Also by Theorem 2.9 it is the unbiased estimate with minimum 
ellipsoid of concentration. 

For the parameter (u, u? + o°) we look for an unbiased estimator based 
on the sufficient statistic. By examining the expectation of 7? we find 


—1/ 1 
that the required statistic is (e. z+ £ ( de = »)). This 


n \n—l 
estimator has minimum ellipsoid of concentration and minimum risk 
(strictly convex loss) among the unbiased estimators. 


> The Lehmann-Scheffé theorem proves under rather special assumptions 
that, if there is an unbiased estimator with uniformly smallest risk, then it is 
essentially unique. However this is true in general on the one assumption that 
the loss function is strictly convex. 

THEOREM 2.10. If fi) and f,(«) are unbiased estimators of g(0) having 
uniformly minimum risk (strictly convex loss), then fi) = f(x) almost 
everywhere {Py}. 


Proof. Let R(0) be the minimum value of the risk at 0. Then 
RO) = EW (f(X), 3 
= EW (f(X), o). 
Since f,(x), f(x) are unbiased, x fi(v) + (1 — a) fav) is also an unbiased esti- 
mator of ¢(0), and it must have of course risk at least as large as R(0); that is, 
(2.21) EW (a f(X) + — D fX), O} = RO). 
However, with « € ]0, I[ the strict convexity of W gives 
(2.22) EotW (a fX) +U — 9 A(X), 0) 
< Egia W(X), O + (L — 2) WAX), 3 
<a R(0) + (1 — «) RO) 
< RO), 


but with equality if and only if 
Wia pæ) +A — Af), 0) = % WX) D + D WE), 0) 
with probability (Ps) one. Again, by the strict convexity, this last condition can 
only hold if f(x) = f) with probability (Po) one. 
The two inequalities (2.21) and (2.22) together imply that 


Ey(W(a fX) + 1 — D flX), 3 = RO, 


ark in the paragraph above implies that 


a is í i e last rem a A i 
nd this inequality by th e. Thus any two minimum risk estimators 


f(e) = f(x) with probability (Po) on 
Must be essentially equivalent. 
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If a problem has a complete sufficient statistic, the Lehmann-Scheffé theorem 
provides a constructive procedure for obtaining minimum-variance, minimum- 
risk unbiased estimators so long as the parameter in question has an unbiased 
estimator. Even in cases where a sufficient statistic that is complete cannot be 
found, the Rao-Blackwell theorem shows how any unbiased estimator can be 
improved upon by making it depend directly on a sufficient Statistic. If the 
statistician is willing to accept squared error for the loss and variance for the 
risk then it remains possible under weaker assumptions to describe those 
Statistics that are minimum-variance unbiased estimators of real parameters. 
For this we need to define a class of unbiased estimators of zero. 

The class of unbiased estimators of zero for {P| 0 EQ} based on a sufficient 

statistic t(x) is the class vg of statistics: 


vo = {f(| Eg f(T)} = 00 E€ 9). 


A related class which is sometimes the class of statistics that are minimum- 
variance unbiased estimators is 
Ey{h(T)} exists ford EQ 


i [mo Eyth(T)f(T)} =0_ for f(t) € vo, and each 8 for which on < o. 


v consists of all those statistics having finite expectation and, when their variance 
is finite, zero covariance with all statistics in ¥%. The following is an extension 
of a theorem by Lehmann and Scheffé. 


THEOREM 2.11. If all the statistics in Yo have finite variance, then a statistic 


is a minimum-variance estimator of its expected value if and only if it belongs 
to». 


Proof. By the Rao-Blackwell theorem we restrict attention to estimators 
based on the sufficient statistic (æ), 

Let A(t) be a minimum-variance unbiased estimator of (0). If f(t) belongs to 
vo, then A(t) + 2 f(t) is also an unbiased estimator of g(0), 


Egth(T) + 4 f(T)} = Ey{h(T)} + ŻE f (T)} 


=g0)+1:0 

= g(0), 
and must have variance no smaller than that of the minimum-variance estimator 
h(t). Therefore 


Chea apr) = ofer) + 22 om0) + AEMT) f(T)} 
> aher(0). 
If oF 7)(9) is finite, then the above inequality becomes 
ofm + EMT) f(T)} = 0, 


and holds for all 2, By taking 4 small, the first term becomes negligible with 
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respect to the second, and, by then changing the sign of 4, the inequality would 
be reversed producing a contradiction—unless as must be the case the second 
term is zero, Thus £,{h(T) f(T)} = 0 whenever oj7)(9) is finite, and therefore 
A(t) belongs to 1. 

Now suppose /(r) belongs to *. Let A(t) be any other unbiased estimator of 
g(0) = E,{h(T)}. Then by subtraction f(r) — A(t) is an unbiased estimator 
of zero, say f(t), and belongs to 7o. Then, using the definition of », we have 


er) = Fiery +47) 
= oj 7)(9) if oir = © 
= h) + Fry) if oin < o. 
In either case we have ofr) = jer), Which means that A(t) is a minimum- 
variance unbiased estimator of its expected value. 4 


2.3. Invariant Estimation. In the previous section the requirement of 
unbiasedness was used to restrict the class of estimators in the hope of 
finding a good estimator, say with minimum risk, in the smaller class. 
Another property that can be used is invariance, and we discuss it briefly 


in this section. 


In general terms the method of invariance is based on the following ideas. 


Suppose for an experiment that a statistician knows the probability 
measure is in the class {P,|0 € 2} of measures over 2(f) and that he is 
interested in estimating the parameter g(0). _Also suppose that he knows 
of a transformation sx which maps 2 into 2 in such a way that the class 
of induced measures for sX is exactly the class {P,|9¢Q2}. Then the 
transformation s is called invariant in the sense that it leaves the probability 
Structure of the problem unchanged. Of course, if the transformation cy 
is applied to the outcome of the experiment, the new probability measure 
in general will not be the same as the original measure. If the statistician 
has decided on an estimator for the problem, then there are two courses 
Open tohim. He can use his estimator with the outcome of the experiment 
and obtain an estimate of the parameter g(9). Or he can apply the trans- 
formation to the outcome, and then, since the new problem is poms ad 
same as the old, he can use his estimator to obtain an estimate és t ne 
value of the parameter. His estimate for the oriana parame sada 
value which corresponds to his estimated value of the new p z Z» 
The estimator is called an invariant estimator if these two es o 7 > 

equal. The restriction to estimators having this property is calle e 


invariance principle. : me 
We aoe ches isos concisely by means pia simpe mame E 

that a length is being measured. If an gntimate ie p a ined for the 

measurements expressed in inches and if an estimate 1s obta 
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measurements expressed in centimeters, then the two estimates should 

correspond. . 

We now formalize these ideas. Let the class of probability measures be 
{P,|9 E Q} over Z(.0/), and suppose that we have a class Y of trans- 
formations sx which map & into itself Z. We shall call Y an invariant 
class of transformations if it satisfies the following modest restrictions: 
(1) GY isa group; that is, it satisfies: 

(a) If 51, so E€ F, then the product transformation s,s, € F, 
(b) IfseY, then the inverse transformation s~! € Y. 

(2) The class of measures {P,|9 € Q} is closed under Y; that is, if X 
has the probability measure P,(0 e Q), then sY for seg has the 
probability measure P;, where 50 € Q. 

The second restriction has the following interpretation. If a transforma- 

tion s is applied to the outcome of an experiment, then the measures that 

describe the transformed outcome should be ones included in the original 
class of measures. Thus in a certain sense the application of a trans- 
formation in Y does not alter the problem but leaves it “invariant.” The 

first restriction is to insure that the inverse of each transformation is in Z 

and that, if we apply two transformations successively, 

transformation is also in the class Z, 

It is to be noted that, for each transformation s€Q, there is a cor- 
responding transformation § which maps Q into Q. Problem 18is to prove 
that each § maps Q onto Q in the form of a one-to-one correspondence. 
Also it is quite easy to prove that the class G of transformations § is a 
group. See Problem 19. 

We have a class Y of transformations which leave the probability model 
unchanged. Consider now the estimation of a real parameter g(0). To 
apply the methods of invariance it is necessary that the class Y be restricted 
So that it leaves the structure of the parameter unchanged; we impose the 
further restriction: 

(3) For each seg, g(0) = 
0,0 EQ. 


This means that, when a transformation s is applied to the outcome, a 
value of the parameter 8(9) is also transformed, and the new value does not 
depend on which 6 corresponded to the original value of (9). Thus a 


transformation s on Z or the corresponding transformation § on Q 
a transformation on the values of the 
transformation by 


then the composite 


g(9’) implies that g(50) = g(50’) for all 


induces 


parameter g(9). Designating this 
5, we have the defining equation 


5, 8(9) = g(50). 


This means that, if 8(0) is the parameter value for X, then 5, g(0) is the 
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parameter value for s¥. If condition (3) is fulfilled, we say that the class 
@ is invariant for the parameter g(9). 

The invariance principle for the estimation of g(0) is to confine attention 
to the invariant estimators: 

f(x) is an invariant estimator for g(9) if 


5, f (x) =f (sx) 


for alls €G and all xe X. 
The interpretation is that, if a transformation in GY changes the parameter 
values, then the values of the estimate should be changed in exactly the 
same manner. 

For the estimation of the parameter g(0) we suppose then that the 
Statistician restricts his attention to the estimators which are invariant 


for g(0) and looks for one having some optimum property suchas uniformly 
minimum risk. If there is a loss function which is natural to a problem, it 


is usual to impose a further restriction on the class of transformations G; 
viz., that the loss for a decision in the untransformed problem should be the 
same as the loss for the corresponding decision in the transformed problem. 
We therefore introduce one further restriction: 


(4) For each s e f, ee 
W(f, 9) = WS, f, 50) 


for all 0 EQ, fe RÈ. 
A loss function satisfyin 
invariant loss function. 


g this requirement for a given group is called an 


EXAMPLE 2.6. Let the random variables Y}, °° *, Y,, be defined by the 


€quations, 
Y, =a+ Bay + U; 
Y, = 0 + pt, + Un 
i ndom variables, each with the uniform 
where U,, -+ +, U,, are independent ra Bee of o tie 


a : ‘ 
distribution over the interval [—}, + 4]. 


tributions corresponds to all values of (a, a be 
Ty * +, ty are eat constants of the problem. This is the probability 


model for the simple regression problem, but with the one recta a 
“errors” have a uniform distribution with known im Bo ele 
normal distribution. Consider the estimation © 


parameter («, A). 


fp) € R?, and the numbers 
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We can assume that Ex; = 0. For otherwise we write 
a + px; = a+ PĒ ++ B(x; — 2) 
=«' + B(x; — 2) 


and, using the simply transformed parameters «’ and f’, the vector 
corresponding to A’, (2, — #, ++ -, x, — 7), has the sum of its coordinates 
equal to zero. 


Consider the group of transformations 
F = {sy = Y; + a, + be, (i = 1, + +5, n) | (ay, bi) € R?) 


This class of transformations satisfies our requirements. First, the class is 
a group (actually a symmetric group). Second, each probability distribu- 
tion is transformed by an element of ¥Y into another of the distributions 
for the problem. In fact, we have the following induced group of 
transformations on the parameter space R? of (a, p). 


= Sa=a+a, 

g= 
is P=ß +b, 
A statistic for the estimation of the parameter («, 8) will be a pair of 
real-valued functions, (f (y1, ` * +; Yp), 81» ** *, Yn)» We shall consider the 
application of the invariance method to this estimation problem. Accord- 
ing to the theory above, we want the estimate ( f, g) to be transformed byan 
element of the group Ẹ in exactly the same way as the parameter («, b) 
being estimated is transformed. We therefore have the following condi- 

tions on ( f, g): 


(a,, b) € r) 


ST (sy) = 5f (y), 
g(sy) = 5 gly). 
Substituting a typical transformation s, we obtain the equations 
Lr + ay + bitt, Yn + as + been) = f Yas +s Yn) + ap 
BY F as + bitit y Yn H as + btn) = By, ty Yn) + Bye 


The invariance method is to restrict our attention to the estimators 
satisfying this requirement and make us look for one for which the risk 
function is a minimum., 


A reasonable loss function for this problem might be 
WH g: % P) =P — a) + qig — P), 


where we take p,q >0. This represents a weighting of the squared error 
foreachofæand f. The estimator which minimizes the risk corresponding 
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to this loss function is of the following form: For a given outcome 
(Yi °**, Yn) there are only a certain set of values of (æ, f) for which the 
probability density at (y1, © * s Yn) is positive, not equal to zero; the value 
of the estimator for this outcome is the center of gravity of this set of 
“possible” values for («, f). The derivation of this result is given as 
Problem 20 at the end of the Chapter. See [9]. 

It is perhaps interesting to show that the loss function introduced above 
is an invariant loss function: 


Wf, 5g; Sa, 5B) 
= psf — Sa)? + gg — SP)? 
= p(f — A + ge — 8)? 
= WI f, g; % P). 


3. THE THEORY OF HYPOTHESIS TESTING 


3.1. Introduction. Suppose for a given experiment the statistician has 
decided on the class of distributions {P,| € Q} over the space ZUL). 
Then the statistical problem remaining is what we call a hypothesis testing 
problem if there are only two decisions which can be made at the completion 
of the experiment: the decision that the parameter value 0 which produced 
the outcome is in a subset of Q, or the decision that it is in the complement 
of that subset. Also, there is often an asymmetry inherent in the problem 
subsets represents the situation found in similar 
he status quo, while the complement represents some 
It is for this reason that the usual 


whereby one of the 

problems in the past, tl 
new situation that may be present. 
method of treating these problems is also asymmetric. The object is of 
course to make the decision appropriate to the situation—to make a 


correct decision as to which set the underlying probability measure is in. 


An example of a hypothesis testing problem is in Example 1.2 in Section 1 
n the method of treatment 


of this chapter, page 43. There, the asymmetry i 
is in the restriction to decision functions in the class 2,. 

We designate by œw the subset of values in Q which correspond to the 
probability measures of the “standard” situation and by Q —o the 
complementary set which corresponds to the measures of the new situation. 
We refer to these as the hypothesis: 0 €w, and the alternative hypothesis: 
9€Q—w. This latter term will usually be abbreviated to alternative. 
The two decisions, one of which the statistician must make on the comple- 
tion of the experiment, are d,, the decision to accept the hypothesis and say 
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that 6 belongs to w, and d, the decision to accept the alternative and say 
that 0 belongs to Q — w. 
Because there are only two decisions, the loss function has a simplified 
form. Let 
W(d,, 0) = W,(9) 
and 
W(dy, 0) = W,(0). 


If the 6 of the underlying probability measure is in w, then a decision d, to 
accept the hypothesis is correct, and we usually require the loss to be zero: 


W,(0) =0 


for 0€w. Similarly, for ð in Q — w, the decision d to accept the 
alternative is correct, and we usually have 


W(0) =0 


for 0EQ — w. It follows that we can further simplify the notation and 
designate by W(0) the loss resulting from an incorrect decision; then we 
have 


w(0) = W,(9) + W,(9). 


A decision function also has a simplified form. For each outcome 
xe % there will be associated either d, or ds. Consequently a decision 
function can be represented by a subset of 2 which is cailèd the critical 
region, and consists of those points x that result in the decision d, to accept 
the alternative. These are the points for which the statistician makes the 
decision that the probability measure is one of those representing the “new” 
situation. 

In hypothesis testing the randomized decisions play a very important 
role. Since there are only two decisions to which a randomized decision 
can assign probability, it suffices to give the probability for one of them 
which by tradition is dp, the decision to accept the alternative. Accordingly, 
we describe a randomized decision function by means of a real-valued 
statistic ġ(x), which is called the test function and is defined over Z. We 
require 4(z) to satisfy 0 < d(x) < 1. Fora given outcome 2, (2) is taken 
to be the probability with which the statistician accepts the alternative, and 
then 1 — (x) is the probability with which he accepts the hypothesis. 
The test function corresponding to a nonrandomized decision takes the 
value | on the critical region and the value 0 elsewhere. 

In the examples of Section 1 the term operating characteristic was 
introduced for the function that gave the probability for each decision in 
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each situation. Since there are only two decisions, the operating 
characteristic of a test function (x) can be described by a single function 
P,(6), called the power function. P,(9) gives the probability of accepting 
the alternative when the parameter is 0. Since 4(x) is the conditional 
probability of accepting the alternative given the value x, then 


P49) = Ex{X)} 
= [ $(2) dP). 


We can obtain the risk function directly from the power function; we 
have 


ll 


W,(0) P0) for Jew 
= W,(0)(1— P40) for 0EQ-— o. 


The risk function is just the power function or its complement weighted at 
each value of 0. Therefore it should not be surprising that much of the 
theory of hypothesis testing is based directly on the power function, andin 
fact for many of the standard problems a loss function is not even 
considered. a . 

In hypothesis testing the class of decision or test functions Hy restricted 
to those that give the statistician protection in the “standard” situation 
represented by the hypothesis. This protection takes the form of a 
bound on the probability of an incorrect decision. To formulate this we 
need the definition: 

A test function is of size « if 


61) PAO) = | He) dP) <2 


for all 0 Eu. 

For 0 belonging to œw the power function P,(0) is the probability of an 
incorrect decision; therefore the definition means that, when the under- 
lying probability measure is represented by the hypothesis, the test makes a 
wrong decision with probability no more than «. The statistician will 
examine his experimental situation, and choose a value for « (often 0.10, 
0.05, or 0.01) to give the protection he desires should the parameter value 
in his experiment be one of those of the “standard situation. He then 
restricts his class of test functions 2, to those of size x. For some later 
results in this section it is convenient to have a somewhat more restrictive 
definition: 


A test function is of exact size « if 


PAO) = | #2) aro) <a 


R,(9) 
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for all 6 € œ and if 


PAO) = | $e) dP) = a 


JE 


for at least one 0, or more generally if 
sup P,(6) = a. - 
6€u 


In choosing a test function in 2, the statistician wishes to minimize the 
tisk. For 0 belonging to œw, he has already obtained protection by the 
restriction on the tests in 2,. Therefore he only examines the risk 
function for those values of 0 in Q — w, and for these 


R;(0) = W(8) (1 — P,(0)). 


To find a test with minimum risk is clearly to find a test with maximum 
power. Itis worth pointing out that the function P,(0) is named the power 
function because, for 0 belonging to the alternative, P,(9) is the probability 
of a correct decision—power in the sense of ability to detect a probability 
measure belonging to the alternative. 

For some of the simpler problems it happens that the test function having 
maximum power for one value of @ in Q — w also has maximum power for 
every other 0. Such a test function is called a uniformly most powerful 
test function of size x; this is frequently abbreviated to most powerful test 
of size. Among the tests in Z, such a test has uniformly smallest risk, 
regardless of the loss function, provided of course we accept the size 
condition as giving the protection under the hypothesis and we examine 
the risk function only for those 0 belonging to the alternative. It would 
seem then that a reasonable first step toward getting most powerful test 
functions is to find a procedure for obtaining a size æ test that has maximum 
power for a particular 0 in the alternative. This we shall do, but first we 
give a verbal picture of the search for a test. The statistician finds all the 
test functions of size « and puts them together to form the class 2, With 
each test (x) he associates a collection of real numbers (P,(9’), P0”), 
P,(0”), +++) which gives the power or performance of that test for the 
different situations 6’, 6”, 6” of the alternative. For a particular 0 he can 
examine the class 2, and pick out the test having the maximum power for 
that 0 (see Problem 21). He could repeat this for another 6, say 0’, and 
he would be surprised and lucky if the same test produced the maximum 
power at 0’. 


3.2. The Fundamental Lemma. If there is only one probability measure 
in the hypothesis, that is, one 9 in w, we speak of a simple hypothesis, and 
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if there is more than one we speak of a composite hypothesis. Similarly, 
Q — w can be a simple or a composite alternative hypothesis. The theorem 
we now consider produces a most powerful test for any problem having 
a simple hypothesis and a simple alternative. 


THEOREM 3.1. (NEYMAN-PEARSON). THE FUNDAMENTAL LEMMA. For 
testing the 


Hypothesis: P(A) = [rŒ du(x), 
JA 
against the 


Alternative: P’(A) = | st du(x), 
JA 


a most powerful size-« test exists and has the form 


g(x) 


(3.2) j=l if Fa>e 
=z =C 
<ë 


Where ¢ and a are constants chosen to make the test have exact size %, 


(3.3) [ gla) dP(w) = a- 
JT 


Any two measures, P(A) and P’(A), will satisfy the requirements of the 
theorem. Problem 23 is to show that, for any two measures P, P’, there 
exists a dominating measure (A): Px(A) << MA), P’x(A) << MA). 
The Radon-Nikodym theorem then supplies the probability density 
functions, f(2) and g(x). 


Proof of Theorem 3.1. We are looking among the test functions that 


Satisfy 
(3.4) f d(x) f Œ) dula) S % 
b Jg 

for one that gives 
a | so O due) 
5 Jy 
its maximum value l . 

We first show that a test of the form (3.2) can be found to satisfy the 


king real values (or +00 if 
X. Corresponding to the 
distribution on the real 


Tequirement (3.3). gæl Œ) is a function ta 
the denominator is zero) and defined over 
Measure P(A) of the hypothesis, it has an induced 
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line, and this distribution is restricted to the positive axis (and possibly 
+0). Let c be a number such that 


Pr (SR ETET an M 


S) fay T 
If we use the symbol ¢,„ defined in Section 2.2, then we can write 
=" g(x) 
ten) 
} 
paee 
l-a E 
| 
l 
| 
| 
| 
Y e x 


Figure 9. The hypothesis distribution of g(X)/f(X). 


Figure 9 illustrates the derivation of this value c. Problem 24 is to prove 
that, for X having the hypothesis distribution, g()/f(X) does not have 
probability at +20, and hence that c < +00. From the definition of c we 


have a (a9 R dza 
Aea aL? Pea = 


Consequently there will always exist a number a between 0 and 1 such that 


Pr (Ep >e| } a pr (5 e] a, 


But this last equation demonstrates that (3.3) is satisfied: 
fso dP(x) = f lx) f(a) du(x) 


=1: pr (> c} +a: Pr jR- c) 


+0-Pr| 


at 
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We now show that the test 4(z) satisfying (3.2) is at least as powerful as 
any other test, say }*(x), of size œ: We divide Z into three disjoint sets 


Z+, ZO H-. a | a 
Lt = >e), 
a = (218 —e}, 
a= fea] 


On these sets, respectively, g(x) > ¢ f (2), = c f(x), and < c f(@). We 


now compare the powers 
P,—P,. = [ee g(x) dula) — f POs dula) 
= fo — $*) g(x) du) 
= | ($ — $*) g(x) dule) + [ — $*) g(x) dule) 
g+ Ji 
+ | — #8 dula). 

On Z+, d(x) is equal to one; hence ¢ — $* is positive or zero. On 
a, (x) is equal to zero; hence ¢ — ¢* is negative or Zero. Then, 


Noting the sign of the integrand for each term and using the relative 
magnitudes of g(x) and cf (x) on the three sets, we obtain 


P, — Pp > f ($ — CAAC) du(x) + [.@ — ġ*)c f (Œ) d(e) 
+ f 6- SO due) 
JI- 


= e | @— ase ane) 
e [| 8S auc — [erro due] 


efa- [gre auc] 


0, 
Where the last inequality follows from the fact that 
Tacket is positive or zero. 


ll 


IV 


IV 


$* satisfies 3.4 and the 
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EXAMPLE 3.1. Let X,,---, X,, be independent, and let each have the 
normal distribution with mean & and variance 1. Consider the simple 
hypothesis and alternative. 

Hypothesis: & = 0, 

Alternative: £=€£' (>0). 
We shall use the fundamental lemma to find a most powerful test of size «; 
that is, to find among test functions ¢(z) satisfying 


(i) [ge k exp [- > | II dz, < a 


one that maximizes 
a 1 a 
k —x ;— &')2 ; 
(ii) f sw exp [ 58) | TI des 
where k = (277)-"/2, By the lemma, the most powerful test is 
.- Kexp [—32X(z; — &')?] 
=1 f —— 
go) i k exp [—3227] 
=0 <o, 
where it is unnecessary to consider the points corresponding to the equality 
with c, since, as we shall see, they have probability zero. We have 


= i¢ k exp[—32(a, — &')*], 
Ost E kapi 
= 0 se 

—1 S22 "Se, — Int? 
exp [—32a,? + &’ Xa, — Iné loa 


> ë 


>c 


if Ta 
exp [327] 
Ti 
if expé' Dr; >e 
<0, 
where c stands for a constant with respect to æ, ++, „, and it may be 


different in value from line to line in the relations above. The only 
restriction on c is that at some stage it be chosen to give the test function 
exact size œ under the distribution of the hypothesis. 


d(x) = 1 f Fese 


=0 ee 
if Bye 

<E 

if >c 


TA 


2 
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The test procedure is to accept the alternative if ¢ > c and to accept the 
hypothesis if # < c. Since Pr(¥ = c) is equal to zero under both the 
hypothesis and the alternative, the definition of the test for points having 
= c is unimportant. We now choose the value c: 


d(x) k exp [—3Z27] IT dx; 
Jae 


= Prz-o{¥ > c} 
= Preio{nl2X > nc} 
=g 


has the normal distribution with 


Under the hypothesis distribution n”? X 
z, where z, is the value exceeded 


mean 0 and variance 1; hence m/c = 
with probability « according to the standardized no mal distribution. It 
follows then that a size « test having maximum power for the alternative 
IS given by 
f(x) = 1 if ï> n-V2z, 
CE (ie A 
depends on the 


Now, the nice thing about this test is that it in no way 
most powerful 


value £’, so long as ¢'>0. Our test is thus a uniformly 
test for 
Hypothesis: E= 0, 
Alternative: § > 0. 
a EXAMPLE 3.2. Let X bea random variable with the Poisson distribution 
given by 


m” 
P,,(A) = fer a dN(x) 


-negative integer and zero measure 


where N gives unit measure to each non i 
er the hypothesis testing problem 


to the set of all other points, and consid: 
Hypothesis: m = Mo 


Alternative: m = Mı (>m). 


By the fundamental lemma the most powerful size-a test is given by 


maa 
emi [e = 

a= i eS |nan >er 
(z) = 1 if ET zl 


<0, 


Where € is chosen to give the test exact size % by satisfying the equation 


Pry, (X> 0} + a Pr {X595 
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We have 


mo x 
e mo/x! 


de pf sss | Loo 
2 


1 c 


Figure 10. The test function ġ(x) and probability density function e~™or,2/x!. 


The test procedure is to accept the alternative if x > c, to accept the 
alternative or the hypothesis, respectively, with probabilities a and 1 — a 
when a = c, and to accept the hypothesis when x < c. 

It is for problems such as this involving discrete distributions that the 
randomized test functions offer a distinct advantage by allowing the 
Statistician to use an exact size-« test and thereby increase the power. 

The test in this example does not depend on m so long asm, > mọ. It 
is therefore a uniformly most powerful size-« test for 


Hypothesis: m = mp, 
Alternative: m > mg. 
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3.3. Composite Hypotheses. If the hypothesis and alternative are 
simple, the fundamental lemma in the preceding section gives explicitly the 
test function of size « having maximum power. But, if the hypothesis is 
composite and the alternative simple, we do not have an analog of that 
lemma. However, the lemma, in conjunction with a method inherent in 
Wald’s work [1] and developed by E. L. Lehmann, can sometimes be used 
to obtain size « tests with maximum power. 

Consider the hypothesis testing problem 


Hypothesis: 0 €w, 

Alternative: 0=0 (0 $w). 

The method is to look for a most powerful test not just among tests 
satisfying 


(3.6) 


(3.7) [oe due <a 
for each 0 € w, but among the larger class of tests satisfying only 
(3.8) Jo dP (2) <% 


for some particular 0). In effect, then, we are considering the simple 


hypothesis, simple alternative problem given by 
(3.9) Hypothesis: 0 = Do 
Alternative: 0 = 0", 


and this can be handled by the fundamental lemma. If we should be very 
lucky and find that the most powerful test in the class satisfying (3.8) just 
happened to be in the smaller class satisfying (3.7), then it is obviously most 
Powerful among those in the smaller class, and hence is the most poneti 
test for the original problem (3.6). How are we to guide our choice of 49? 
If we consider the various measures in w and look for one that seems to 
Most resemble the measure of the alternative, 0’, or that would seem to be 
Most difficult for the statistician to test against or distinguish from the 
alternative, then the most powerful test of the simple hypothesis may be of 
the correct size for the original problem; i.e., be in the smaller class 
Satisfying (3.7). Such a choice of 0 in œ is called “least favorable” to the 


Statistician. 


EXAMPLE 3.3. For the probability measures of Example 3.1, consider 
the hypothesis testing problem: 
(3.10) Hypothesis: §< 0, A 
Alternative: €=& (6> 0. 
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What value of £ in the hypothesis would be most difficult for the statistician 
to distinguish from the alternative value &,? The most obvious choice 
guided by intuition is to choose the & closest to å: i.e., = 0. Therefore 
we first consider the modified hypothesis testing problem with a simple 
hypothesis: 


Hypothesis: &=0, 
(3.11) a 


Alternative: & = &,. 
But from Example 3.1 we know that for (3.11) the most powerful test is 
(x) = 1 if 2iezn- 
5 <a, 
We must now see if our choice of € was truly least favorable; that is, if the 
test A(x) is the smaller class of tests that are of the correct size for the 
original hypothesis. We evaluate, for € < 0, 
Ex{$(X)} = Pry { Z> 212} 
=Pr {Z >z, — én/2} 
Py Sz.) 
=a 
Z designates a random variable with the standardized normal distribution, 
and the inequality is a consequenge of & being less than or equal zero. Our 
test is of the correct size, and hence by the argument preceding this example 
is the most powerful test for the original problem (3.10). 
Since this test does not depend on the value £,, we have a most powerful 
size-« test for 


Hypothesis: & <0, 
(3.12) JP 


Alternative: &> 0. 
In Section 1, this example was considered with n = 9. We have here 


the proof that the test examined was a most powerful test in the class 2, 
of tests satisfying the size restriction. 


In most problems there will not exist a single 0 in w which is least 
favorable. However, sometimes a weighted average of hypothesis 
measures will be least favorable. Then, instead of examining the tests 
that satisfy 


(3.13) P,(0) = [se aP(x) <a 
for all 0, we examine the larger class of tests that satisfy the one condition 


(3.14) MEO dPya)| dio) <a, 


where A is the probability measure over œ which gives the “weighted 
average.” Obviously any 4(x) satisfying all the conditions of (3.13) will 
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pira an integrand <a for the final integration in (3.14) and hence will 
na G.14). This justifies the statement that the second class of tests is 
arger” in that it contains the first class. 
If we can interchange the order of integration,7 the condition (3.14) can 
be written 
(3.15) f glx) dP(x) < 4, 
IT 


where 


PA) = [ P,(A) di(0). 


— if the measures in œ are dominated by a measure x and have the 
ensity function f(x), then the condition (3.14) can be written 


6-16 [$e fe ducer < 
Jt 


where 
file) =| fle) a0) 


This last step assumes that f,(x) is measurable as a function of (x, 0) over 
T X w. We consider this latter form, using the condition (3.16). 

The idea again is to try to choose a weighting 2. which produces an 
average of the hypothesis measures that most resembles the alternative, 
which would be most difficult for the statistician to distinguish from the 
alternative, which would be least favorable to the statistician. The 
Procedure is to replace temporarily the original problem (3.10) by the 
Simple hypothesis modification, 


Hypothesis: P;(4) = f OLLON 
(3.17) S 
Alternative: P(A) = i Sola) dul), 


s of the fundamental lemma. Then, 


and find a most powerful test by mean 
least favorable, to see if the test is in 


to see if our choice of A(0) was really 
the smaller class of tests for the original problem, we must check whether 
it satisfies (3.13) for all 9 ew. If it is of correct size, then by the same 
argument as before it is the most powerful test for the original problem. 


nt, and each have the 


EXAMPLE 3.4. Let Xp" Xp be independe 
Consider the hypothesis 


normal distribution with mean € and variance o. 

testing problem: 

(3.18) Hypothesis: o° < 
Alternative: 0“ = 


¿eR 


a 
2 ote 


+ See Robbins [14]. 
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Of course, in considering this problem we really have in mind the problem 
of testing the variance alone, namely; 

Hypothesis: o? < 3, 
(3-19) Alternative: o? > o}. 
However, as a first step we take a particular measure in the alternative of 
(3.19) and consider the simpler problem (3.18). 

The hypothesis is composite. What weighting A(£, o?) of the hypothesis 
measures will most resemble the measure of the alternative? First, with 
ë free, we would naturally think of setting it equal to & and then taking o? 
as large as possible, o? =}. This amounts to putting all A(, o?) 
probability at the one parameter point (&, og). However, if for the then 
modified problem we find the most powerful test, we find that it does not 
satisfy the size condition for the original problem, and hence that this 4 
was not least favorable. 

Let us examine more carefully our choice of probability measure for &, 
o°. First, for a? we naturally want to choose it closest to its alternative 
value; that is, make it as large as possible, and put all the A probability at 
o? = oğ. Second, we consider £. € controls the distribution of @ but has 
no effect on the remainder of the outcome ty — &,-+++,x,—. So it is 
natural to see how, by weighting & we can make the distribution of ¥ 
under the hypothesis most like its distribution under the alternative. 
Under the hypothesis weighted for o?, ¥ has the normal distribution with 
mean ¢ and variance oj/n; under the alternative, the normal distribution 
with mean &, and variance oj/n. We describe the distribution of ¥ 
symbolically: 

Hypothesis: £ = £+ Y, 
Alternative: = & + Y, 


where Y, Y, are normally distributed with means zero and variances, 
respectively, og/n and oj/n. By giving £ a normal distribution with mean 
§, it is easily seen that the marginal distribution of ¥ under both the 
hypothesis and the alternative is normal with mean €,. Then by appro- 
priately choosing the variance for we can make the two marginal distribu- 
tions of ¥ also have identical variances. The appropriate variance ož is 
found by equating the marginal variances of ¥: 


> 
n n 


whence og = no} — o2]. Our A(E, o2?) measure thus chooses o? = oR 
with probability one and gives & the normal distribution with mean é and 
variance n— [oj — oê]. 
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The original probability density function can be written as a product of 
the density function for ¥ and a factor for the remainder of the outcome; 


1 
(3.20) fe ge(x) = k exp [- =. E | exp | za De ay, 


where k depends only on the parameters and is irrelevant for the ratio of 
probability densities used in the fundamental lemma. Under the A(é, o°) 
weighting of the hypothesis, g? is set equal to o, and an integration is 
performed for the distribution of ë. But & occurs only in the -density 
factor of (3.20), and we know from our argument above what marginal 
density must result for Y when its conditional density, given &, is integrated 
with respect to the normal distribution of ë. Obviously we have 


1 E 
Jx) = k exp [- z (@ — =| exp [- Ia Xa, — ay]. 


And, for the alternative, we have 
1 
=k ee eee | [-Are—ar]: 
Se,ox(X) = k exp [ ia (@ — &) | exp aa (x; — 3) 
We apply the fundamental lemma to this problem and obtain 


k Saot x) 
f(x) = 1 if A) >c 


=0 <c. 


Now each succeeding expression below is a monotone-increasing function 


of the preceding expression: 
fee) 
Si) 


1 5 
esp [- Fim i 
ee eel 
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Hence the test function can be written 
(3.21) d(x) = 1 if X(e,—a?>c¢ 

=O se 


Under the modified hypothesis, c? = oj, and hence the induced distribu- 
tion of X(«; — 7)? is that of opz? where z? stands here for a random 
variable having the 7? distribution with n — 1 degrees of freedom. Let 
7; be the point exceeded with probability « by °. Then, to give our test 
exact size «, our choice of c is o72, and the test is 
ox) =1 if Ele; — 4? > oh 
=0 < 0573. 

We now check to see if this test is of size « for the original hypothesis. 
When the parameter is (£, ?), the induced distribution of E(x; — 7)? is 
tha: of o?y?; therefore 

EBoe{P(X)} = Prio {2(X;, — X) > of 72} 
= Pr {0?7? > of 73} 
o 
=Pr le > z) 
SP ify? Sa 
=q, 


where the inequality results from oĝ/o? > 1. The test is of correct size and 
hence is the most powerful size-« test for the original problem (3.18). 

The test does not depend on å; or on oj provided of > ož. Therefore it 
is a uniformly most powerful test for 


Hypothesis: o? < og, &e R}, 

Alternative: o? >of, £e R}, 
which is the more general problem (3.19) mentioned at the beginning of 
the example. 


3.4. The use of a Sufficient Statistic. In Section 1 we had a general 
theorem concerning the use of a sufficient statistic. We have now a 
closely related result concerning the use in hypothesis testing. 


THEOREM 3.2. If J(x) isa test function fora hypothesis testing problem 
involving {P,|9 € Q}, and if t(x) is a sufficient statistic, then Efd(2)|t} isa 
test function having the same power function as ¢(z). 
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Proof. The power function of ¢(z) is 
P,(8) = EF ($00). 
But, from the definition of conditional expectation, we have 
P,(0) = Er {E* {$(X)|T}}, 


and hence E{(X)|1} has the same expectation (“power function”) as does 
(x). All we need prove, then, is that E{$(X)|t} is a test function, that is, 
ELAY) |t} satisfies 

0 < E{y(X)|1} <1 


for almost all. But that this occurs follows easily from the representation 
of conditional expectation as an average with respect to conditional 
probability. See formula (4.14) in Section 4 of the preceding chapter. 


> A generalized definition of sufficient statistic was introduced at the end of 
Section 5, Chapter 1. Let {Po,|(0, 1) EO x H} be a class of probability 
measures over 2(s/), A statistic t(x) is sufficient for 0 if the marginal distri- 
bution of 1(X) depends only on 0, that is, has the form {PZ |6 € ©}, and if the 
Conditional distribution, given 1, depends only on 77 (the “nuisance” parameter), 
that is, has the form (PX(A|)|1 E H}. Consider a hypothesis testing problem 
involving only 0: 
(3.22) Hypothesis: 0 Ew, 1 EH, 

Alternative: 9€9 — o, EH. 


For this we have a generalization of Theorem 3.2. 


function for the problem (3.22) and if 
z test function y(r) for the problem, its 
as large as 


THEOREM 3.3. If d(x) is a size-« test 
t(x) is sufficient (0), then there is a size- 
Power function depends only on 0, and for each 0 has power at least 


(3.23) inf P,(, 1), 
neH 


the minimum power of (x) for that 0. 


Proof. Take any 7, say o, and define 
8.24) wt) = E,,{60|- 
By the argument in Theorem 3.2 we have 0 < y() =! for almost all 


hence y(r) is a test function. 
The power function of y(t) is given by 


P,(9, 7) = Ep, (X)? 
= Ef {y(T)}- 


t, and 


(3.25) 
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It depends only on 6. Now, using (3.24) we obtain 
PLO) = Ef (y(T)} 
= Ef Ep N| TH 


= Eon, (X) 
= Px, no); 
and then it easily follows that 
(3.26) inf P4(0, n) < P,(0) < sup P,(9, n). 
n ki 


12.3 


By taking 0 through the values in w, (3.26) proves that y(t) has size x. By taking 


any 9 in © — w, (3.26) proves (3.23). 


A closely related theorem is the following: 


THEOREM 3.4. If r(x) is sufficient (0) for the class of measures {Po,| (9, n) 
E0 x H}, then there is a uniformly most powerful test for the hypothesis 


testing problem 


Hypothesis: 0 =6), EH, 
(3.27) YP! 0 
Alternative: 0=0,, EH; 


it can be chosen to have power independent of n. 


Proof. Consider the related problem having a simple alternative: 


Hypothesis: 0 = 0» EH, 
(3.28) p $ 
Alternative: 0 = 0, 7 =m. 


For this composite hypothesis problem we apply the results in the previous 
section. For the probability measure over the hypothesis Ż(7) it seems natural 
to put all probability at 7, in order to get a measure most like the alternative. 


We have then the modified problem: 


Hypothesis: Pon, (4) = fena dP (t), 
(3.29) 


Alternative: Po (A) = feaa dP, (0). 


Let y(t) be the size-x test obtained by applying the fundamental lemma to the 
hypothesis testing problem on the space 7 having hypothesis measure Po (B) 
and alternative measure Po (B). Let $(x) be any size-x test for (3.29). Then it is 
straightforward to show that the power of ¢(x) cannot exceed the power of p(t(x)), 


and hence that y(r(«)) is the most powerful test for (3.29). 
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Since 
Eon WCX) = Eo, y(T)} 
=4, 
it follows that y(t) is a size-x test for (3.28) and hence is the most powerful 
size-x test for (3.28). The test y(r) is obviously independent of nı. It is therefore 
a uniformly most powerful test for (3.27). 4 


3.5. Similar Tests. When in the theory of estimation we could not 
obtain a best decision function in the full class available, we placed some 
moderate and reasonable restrictions on the decision functions in the hope 
of finding a best one in the smaller class. We do this now for hypothesis 
testing, and the first restriction we consider is that of similarity. 

A test function (2) is similar of size %. for testing the hypothesis 0 € œ if 


P0) =| go) dP ia) =a 


for all 0 Ew. 
For such a test the power function ha 
the parameter in the hypothesis. 1 
similar tests of size «, he is requiring that the test make incorrect decisions 
at the full allowable rate for all measures in the hypothesis. It is for this 
reason that similar tests are open to serious criticism. However, there are 
two things that can be said in their favor. For some problems the mathe- 
matical form of a similar test can be described quite easily. Second, the 
theory of similar tests is of use in deriving a best test under the restriction 


we shall consider in the next section. 

If a problem possesses a statistic which fo istr 
hypothesis is sufficient and boundedly complete, then a similar test function 
has a very simple form. Under the hypothesis the average or expected 
value of the test function, given the statistic, must be a constant value « 
for almost all values of the statistic. The test can then be treated asa 
conditional test, and be constructed in each subspace of values of x having 
(x) = 1, Its size, given the statistic, must be « for the hypothesis; its 
Power can be maximized for any simple alternative by maximizing the 
Conditional power, given the statistic. The problem is then reduced to one 
Which, for each value of the statistic, can be treated by the fundamental 


lemma. 


s the constant value « for all values of 
f the statistician restricts himself to 


r the distributions of the 


THEOREM 3.5. LEHMANN-SCHEFFE. If (x) is a sufficient and boundedly 


isti ilar size-x test (x) 
complete statistic for {P)|9 € œ} then any sim! 
has conditional size «, given t, for almost all {P,|9 e w} values of ż, that is 
(3.20) E(G(X)|9 = 2 
for almost all values of t. 


88 STATISTICAL INFERENCE [2.3 


If a test satisfies (3.30) it is said to have Neyman structure. 


Proof. Let ¢(x) be a similar size-« test; then 

EAX); = « for OE, 
(3.31) E,{E((X)|T)} = « for OE, 

E,{E(A(X)|T) — a} =0 for dew. 
The conditional expectation does not depend on 0 because f(x) is a 
sufficient statistic. In (3.31), EAX) |} — « is a function only of z, has 
zero expectation for each 0, and is bounded. Therefore the bounded 
completeness of ¢(x) implies that E{$(X)|t}—«=0 for almost all 


{Pj |9 Ew} values of t. This is equivalent to (3.30), and proves the 
theorem. 


Let r(x) be a sufficient and boundedly complete statistic for the hypothesis 
measures {P,|4 € œ}, and consider the hypothesis testing problem 


(3.32) Hypothesis: 0 €w, 


Alternative: 0 = 0}. 


We outline the procedure for obtaining a most powerful similar test. We 
want to examine the test functions (x) that satisfy 


(3.33) E,{¢(X)} = « 
for all 0 € w and choose one that maximizes 
(3.34) Eq, {G(X)}- 


But the theorem above says it is equivalent to examine the test functions 
g(x) which satisfy 


(3.35) E,{g(X)|0) = « 


for almost all (w)t. The subscript œ indicates that this is the conditional 
expectation for any 0 € w; the conditional expectation may depend on 0 
as soon as 6 leaves the set œ. Also the expression to be minimized may 
be written 

Ey {Eo {X)|T}}, 


and it is clearly equivalent to maximize 
(3.36) Ey, {4(X)| 1} 


for each t, provided we do not violate the restrictions on (x). However, 
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to maximize (3.36) fits very neatly with the size restriction (3.35) in condi- 
tion form. Our problem has thus been reduced to the following conditional 
problem. Itis to find a test function ¢(x) which for almost all (w) values 
of ¢ satisfies 


E,.{(X)|t} = « 
under the hypothesis and maximizes 
Ex |A 


for all t. Thus the finding of a best similar test over & is equivalent to 
finding the best test on the subspace of points having (x) = #, and this is 
accomplished by applying the fundamental lemma to the conditional 
measures, given f (assuming the conditional probabilities are measures). 


EXAMPLE 3.5. Let Xpt’ Xn be independent and each have the 
normal distribution with mean and variance o°. Consider the problem 
of finding a most powerful similar test of size « for 
(3.37) Hypothesis: # = 0, o? € ]0, of, 

Alternative: w>0, o?e ]0, of. 

2) is a sufficient statistic for the class of 
By Theorem 3.2 it suffices to 
for any other test there is one 
ower function. In terms of 
following form: Yj is nor- 
o2/n; Yo is independent of 
1 degrees of freedom. The 


U(x) = (Yas Y2) = (F, E(x; — 4) 
measure for the problem as a whole. 
consider tests based on (Yı, Y2), because 
based on this statistic that has an identical p 
the new variables 41, Y2, OUT problem has the 
mally distributed with mean x and variance 
Y, and has the o?z? distribution with n — 
hypothesis and alternative, of course, remain the same. 

Under the measures of the hypothesis, the statistic (7) = La? is a 
Complete sufficient statistic. It follows, then, that ny? + Yo = ne? + 


Sie. ae ge 3 rae 3 
U(x, — 3)? = Xa? is a complete sufficient statistic for the hypothesis of the 


Problem in terms of y, and yọ In order to apply the argument preceding 


this example, we need to know the form of the conditional distribution of 
% and yp, given the statistic t(x). This we now obtain. Fora set A in 


the space R? of y, and yy (Ys > 0), we have 


1/2 y, — u}? 
Prod) =| | argo? |- |x 


a (27)"?0 


j Y2 
1 y@ YP? exp |- 4j dy dy). 
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On this integral we make the transformation 
t = nji + Yo 
Y1 = Yr 


which has the Jacobian equal to 1, and obtain 


Piro(B) ne TET] 
~ Igp ( ) g” 
2 


nie 


t nuy nu? 

exp ( eat a oa ) (t — ny)? dt dy,. 

The joint probability density function for t and y, is the integrand of the 
above expression. To obtain the conditional probability density of y, 
given /, we incorporate into the differential dy, a function of y; sufficient to 
make it the differential of the distribution function of y,. Then by analogy 
with (4.12) in Chapter 1 the integrand is the conditional probability 
element we want; it is 


k t nyy 

(3.38) Fo exp (= o a9 Bis) yp) 302 

This expression only applies to the space of “possible” values for a: 
namely those y, for which (t — ny?) is non-negative. Elsewhere the 
probability density is zero. f (t) is the function that was incorporated into 
the differential element, and k is a constant. k/f(t) can be found directly 
from this density function by merely requiring that the density integrate to 
1 with respect to y,. 

Let fun |O stand for the conditional density function of y}, given ¢; 
its functional form is given by (3.38) above. We have in effect the 
conditional distribution of y, and yə, given ¢, because, for a given t and y,, 
the value of yp is determined by the formula y = t — ny?. Hence any 
conditional test, given t, can be based directly on y}. 

To solve our problem we first substitute a simple alternative and 
consider 

H is: =), 2 , CO[, 
(3.39) ypothesis: yx = 0. oe pea 
Alternative: =m, (>0), o? = 03, 


and then apply the argument preceding this example to find a most 
powerful similar test. In conditional form we wish to obtain a test 
function ¢(y,|1) which has size « with respect to the density f,,(y|¢) and 
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maximum power with respect to faal t). By the fundamental lemma, 


we obtain 


a fanl 
P= fe e 
Hu ) 3 fold 


=0 <6 


where, of course, the denominator of the probability ratio does not 
actually depend on g? because x = 0 and ris a sufficient statistic for the 
hypothesis. Now each succeeding expression below when viewed in terms 
of y, is a monotone-increasing function of the preceding expression: 


Figo, | t) 


ACAD $ 
t nyy ua mpy” -2 
a z Yi) 
pi ( 2w% A 2 ; 
t 2)(n—3)/2 
exp (- salt — nyj)(n-3 
nay 
are 
Yr 


Therefore the test can be written 
ganld =1 if n> 
=0 < c(t), 


where e(z) is chosen to satisfy the size condition 
(3.40) [fotos an = 
elt) 


But fosy |£) has the form 
OE — ny? 


on-negative and is zero elsewhere. 


in th ich ¢ — yj is n 
apr siete R ribution of z = (n/t)¥?y, has the 


From this it is easily seen that the dist 
Conditional density function of the form 


KA — z2)(n-3)/2, 


i i = 12y, has a 
Since thi i o 1, A(t) is a constant and z = (n/N) Yy 
this must integrate t ) Let b, be the point exceeded 


conditional distribution independent of t. 
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with probability « according to this distribution. Then b, = (n/r)"/2c(t), 
and the test can be written 
dnl) =1 if n > (tnb, 
=0 < (t/n)¥,,. 


The following relations on y, and ¢ and on y, and y are equivalent: 


Yı > (t/n)?b,, 


(nyi + Y)" 
n> ae as 


e [nly S a ea 172 
Yy"? > a sao biz 


Ys"? > Ca 


The last step follows from the monotonicity of the function a(1 + 22)-¥/2, 
€, is a constant and depends on b, and n. From these relations it follows 
that the test can be written 


z 4 
b(yy|t) = 1 if a > % 


=0 =e, 


or in terms of the original variables 


(x) = 1 if P@, a2 > Cy 
=0 216, 
or equivalently 


nrg 


(3.41) wish FF -gn h 
= 1 Ye — a) | 
=0 < dy 


But this is just the ordinary Student’s ż test. Since it does not depend on 
#4, (>0) or on oj, it is a uniformly most powerful similar size-x test for 


Hypothesis: « =0, o?e ]0, of 


Alternative: >0, o?e]0, æf. 
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3.6. Unbiased Tests. As a second restriction on the class of test 
functions we consider unbiasedness. 
A test (x) of the hypothesis: 0 €w against the alternative: 0 € Q—w 
is unbiased of size « if 
EAX) < a 
Jor 0 €w and 
EAX) = a 
for 9EQ—w. 


In restricting ourselves to unbiased tests we are requiring that a test should 
accept the alternative more frequently when to accept is the correct 
decision than when it is incorrect. Unlike similarity which was primarily 
a device for obtaining tests, unbiasedness is a very reasonable property for 
the practicing statistician to require of his test. 

The condition of unbiasedness being based on inequalities is not as easy 
to handle mathematically as similarity. However, in some problems it is 
possible to make use of our theory on similar tests to obtain a most 
Powerful unbiased test. 


If A is the common boundary of œ and 
us function of 0 for any test $, 
— o is similar of size « for the 


THEOREM 3.6. (LEHMANN). 
Q — w and if the power P,(8) is a continuo’ 
then an unbiased size-« test of œ against Q 
measures of A. 

The theorem assumes a topolog 
Continuity and common boundary are defined. 

s the set of points that are limit points 


both of sequences in œ and in Q—w. Since P,(0)<«% for 9 Ew, 
P,(0) < « for 0e A by the continuity. Similarly, since P,(0) > % for 
eQ w, then P,(0) > « for0 e A. However, the two inequalities give 
P30) = « for 0 € A. 

By this theorem the class of unbiased tests of @ against Q — w is 
contained in the class of tests similar on A. If we can find a most power- 
ful similar test of A against Q — ©, and if this test is unbiased, of size «, 
ae necessarily it is the most powerful test in the smaller class of unbiased 

ests, 


y on Q with respect to which the 


Proof. The common boundary i 


ndependent, and assume that each 


n pand variance g?. Consider the 
biased test for 


p nm 3.6. Let Xp’ Xn bei 
: has the normal distribution with mea 
Problem of finding a most powerful un 
(3.42) Hypothesis: <9; o? € ]0, ol 


Alternative: > 0, o?e ]0, œl. 
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The power function for any test (x) is given by 
fit =e n 
P,(u, 0?) = f 8007o exp |- F p (z: — | II dx;. 
Since the integrand is a continuous function of (4, o?) and since in any 
neighborhood of a value of (u, o?) it is bounded by an integrable function, 


A g2 


—> u 


Figure 11. The parameter space Q in Example 3.6. 


it follows that the power is a continuous function of (u, 02). The 
common boundary for w and Q — w is given by A= {(u, o?)| u = 0, 
o? € ]0, co[}. By the argument above we look for a most powerful 
similar test of 

A: w=0, o€)0, of, 

Alt: “~>0, o €]0, of. 
But by Example 3.5 the most powerful similar test is the ordinary ¢ test. 

We now check to see if the ordinary ¢ test is unbiased. Suppose that £, 

has been chosen so that 


when w = 0. Consider a value of u > 0; then 


1/2 
Eqe{ HX} = Prue {— > 1] 
x 


= Pryg {nV2X > tsx} 

> Prpa {02X — n"? u > t,x} 
1/27 

= Prag = > 1a) 


Sx 
=Q. 
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Similarly, if «<0, we obtain 
Exa{$(X)} < 2. 


These two results establish that 4(x) is unbiased of size «, and hence is the 
most powerful unbiased test for the original problem (3.42). 


3.7. Invariant Tests. In this section a third restriction on the class of 
tests is considered—the restriction to invariant tests. The general ideas of 
the invariance method were introduced in Section 2.3, and we only 
briefly outline them here. 

For many problems there are transformations that can be applied to the 
outcome and that produce a transformed problem statistically the same as 
the original problem. The invariance restriction is then to consider only 
those decision or test functions that have the same values for the trans- 
formed outcomes as for the corresponding original outcomes; such tests 
are called invariant tests. It is certainly a reasonable restriction. For, if 
the problem is not altered by the transformations, then why should the 
result of applying a test function be altered? 

Let the class of probability measures be {Po|0 EQ} over the space 
2(sZ). Then a class Y of measurable transformations sx from 2 into 


& is called invariant for the probability structure if it satisfies 

(D G is a group. 

(2) The class of measures {P,|0 € Q} is closed under G; that is, if X has 
the measure P,(0 € Q), then sX forse G has the probability measure 
Pio where 50 EQ. j 

The class G of transformation § on Q forms a group homomorphic to 

(Problems 18 and 19). 

Consider now the hypothesis testing problem; 
Hypothesis: 0 €w, 
ee Alternative: 0 EQ -— øw. 


do not alter this hypothesis testing problem, 
easures of the hypothesis, 
n measures of the alter- 


If the transformations in Y À 
then the measures of the hypothesis must remain m 
and the measures of the alternative must remat 


native, We summarize this in ee, 
(3) If Oca, then 50 ew; if 0 EQ — w, then sane a ttention t 
The invariance principle for hypothesis testing is to conline attention to 


the invariant test functions: 
(x) is an invariant test function re if 
(sx) = $) 


forallseGY and x € F. 
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In some problems a weaker form of invariance is useful: 
(x) is almost invariant for G if, for each s EY, 


$(sz) = $lx) 
for almost all {P,}x. 


In order to apply the invariance method we need some way of describing 
the invariant test function. This can be. done quite simply by the use of a 
maximal invariant function. For any statistic ¢(«) there is a correspon- 
ding partition of Z ; this was described at the end of Section 1 in Chapter 1. 
If the statistic is invariant, we call the partition invariant. An invariant 
partition has, of course, the property that x and sx are always in the same 
set of the partition. If one partition of 2 is formed by the sets {A} and 
another partition is formed by the-sets {B}, then the totality of sets A n B 
also forms a partition of Z, the intersection partition. Similarly, any 
class of partitions will produce an intersection partition. However, if the 
original partitions were induced by statistics, that is, measurable functions, 
the question naturally arises of the measurability of a function inducing the 
intersection partition. But we can always make it measurable by appropri- 
ately defining the o-algebra on the range of values of the function. We 
take the natural o-algebra as given by formula (2.2) in Chapter 1. It 
follows very easily that, if the original partitions were invariant, then the 
intersection partition is also invariant (see Problem 34). The maximal 
invariant partition is the intersection partition of all invariant partitions. 
From its definition the maximal invariant partition is the finest invariant 
partition in the sense that no set of the partition can have a proper subset 
belonging to an invariant partition. Sometimes it is convenient to think 
of a maximal invariant function, say m(x). This is any function whose 
partition is the maximal invariant partition. The values of the function 
m(x) are unimportant except that m(x) = m(2’) if x and x’ are in the same 
set of the maximal invariant and m(x) + m(x’) if x and 2’ are in different 
sets. As mentioned above we make m(x) measurable by choosing the 
natural o-algebra on the space of values of m(x). 

Another definition of the maximal invariant partition is also of interest. 
With any point x e Z we associate a set containing it defined by 


= {x'e = 93}; 


that is, all points obtained from x by transformations in Y. If we took 
any point x* = s*x in T, and considered T,. , we would have 
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Thus the sets T, form a partition of Z. The sets T, are obviously closed 
under the transformations s, and certainly no proper subset of any T, 
would be closed. Therefore this is the maximal invariant partition. 


THEOREM 3.7. Any invariant statistic d(x) can be expressed as a 
measurable function of the maximal invariant function m(x). 


Figure 12. Typical sets of an invariant partition. 


e written in terms of another function 


m(z) if, whenever for a set of values of x m(x) is constant, then so also is 
$). This is easily seen because (x) then has a unique value for all w 
giving rise to a value for m(x). In terms of partitions this is equivalent 
to the m(x) partition being a subpartition of the d(x) partition, But the 
maximal invariant partition is a subpartition of any invariant partition; 
therefore we can write (2) = f (m(x)). 

To complete the proof we need only show that f (m) is a measurable 
function. Let B be any measurable set in the range of ¢ or equivalently 
of f. We want to prove that fB) is a measurable set. But the 
Measurable sets M in the range of m are those for which m-(M) € 2. 
Therefore we want to prove that mf-(B) is measurable; that is, € A. 
But m-1f-1(B) = (fm)"1B = ¢-(B)- Since ¢ is measurable, f\B)e L; 


that is, m-f-(B) € of. This completes the proof. 


By our Theorem 3.7 any invariant function is equivalent to a function of 
the maximal invariant function. Thus, ina hypothesis testing problem, if 


Proof. A function ¢(x) can b 
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we wish to restrict our attention to invariant test functions, we equivalently 
examine the test functions based on the maximal invariant. We have, then, 
the following invariance method for treating a hypothesis testing problem. 
Find a group & of transformations which is invariant for the problem; 
for this group find a maximal invariant function: calculate the induced 
measures for the maximal invariant function, and consider the hypothesis 
testing problem for these; then look for a best test for this related problem. 
The resulting test expressed in terms of the original outcome by means of 
the maximal invariant will be the best invariant test. 

For the transformations Y on Q, we can define a maximal invariant 
partition. Let m(0) stand for the corresponding maximal invariant 
function. It would be natural to suspect that, for the reduced problem in 
terms of m(x), the probability measures can be expressed in terms of the 
maximal invariant (0) as the parameter. For this we have the theorem: 


THEOREM 3.8. If d(x) is invariant re Y, then the probability measure 
for ¢(X) is constant over each set of the maximal invariant (Y) partition 
of Q; that is, the distributions for ¢ depend on 0 through (0). 


Proof. We wish to show that 
Pro{$(X) € B} = Pry {4(X) € B} 
for all BeZ whenever 0 and 0’ belong to the same set of the maximal 
invariant partition on Q; that is, whenever 0’ = 50. We have 
Pry {$(X) E B} = Przp{4(X) € B} 
= Pry{d(sX) € B} 
= Pro{4(X) € B}, 
where the last step follows from the invariance property of (x); viz., 


(sx) = (x) for all s. 


EXAMPLE 3.7. Let Xj,°°:, X, be independent, and assume that each 
X; has the normal distribution with mean and variance o2. Consider 
the hypothesis testing problem 

i Hypothesis: <0, o%€)0, oof 
(3:44) Alternative: >0, o%e€ 0, of. 


By Theorem 3.2 we can restrict attention to tests based on the sufficient 
statistic (u, v?) = (@, E(x; — z)?). 
Consider the class of tranformation Y*, 
G = {x! = cn, (§'=1,++:, n)|c €]0, oof}. 
These transformations obviously form a group. Because each trans- 
formation is just a change of scale about the origin, a normal distribution 
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remains a normal distribution and the new mean is cy, and the new 

variance is c?o®. Therefore the class of induced transformations on Q is 
G = {u = cu, 6? = c?o?|c € ]0, cof}. 

Now, since the hypothesis and alternative are transformed by elements of 

Z into themselves, respectively, the class Y is an invariant class of 

transformations for the problem. 


Figure 13. The partition for £ = ule. 


The induced class of transformations on the sufficient statistic is 
(3.45) G* = {ul = cu, v'? = cèt? |e € ]0, of). 
We now show that ¢ = u/v is a maximal invariant function. In Fig. 13, 
two typical sets r,t’ of the partition for t = ujv are exhibited. TS 
partition is invariant. For the transformed value of t, t = culcv, is equa 
to the untransformed value ż = u/v for all transformations and all points 


ition i i i iant partition, we 
u, v). e partition is the maximal invariant p: A 
pains bees Consider a set of the 


Show that there is no subpartition that is invariant. Cons 

Partition, say the one indexed by 7’. Obviously a poini in o ap 

tra i a suitable value of c; hence S 
nsformed into any other by Thus the partition 


ro*. 
Proper subset left unchanged by the elements of 9* ! ; i 
does not have an invariant subpartition and hence is maximal invariant. 


To state the problem in terms of the maximal sere ps ae 
derive the induced probability distribution at E t Ee 
convenient to use the equivalent statistic 1* = n*™ “t. ow, 

ny nV?ulo 


(3.46) fis sT = 5 


. 5 > * 
We can represent the induced distribution of t* by 


[R= Ss 
: d vari 1 
; 2 mee 1, 
Where the distribution of Z is normal with mean n™?u]o and varia 
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and the distribution of 7? is the 7?-distribution with n — 1 degrees of 
freedom. If we write n'/?4/o = ô, then the joint probability density 
function for Z and z can be written 


1 = =i} y2 
(27)? exp [- Le-a] [20m r (| gop (— 2) 2x 


By expanding the exponential, we obtain 


a = (zF ( 2? 
1/2 —— — — — 
(27) exp ( 2) 2. Ti exp =) x 


— eal. s 
[2" -d2 DP mI 7" exp (- a 


The transformation (1* = z/y, y = z) has the Jacobian z~}; therefore 
the density function for (¢*, 7) is 


Qn) exp ( Iyer, o| wer) y 


r=0 


n=] -1 £) 
Q(n-1)/2 _n—3 _4)9.,2 
| lig (~ )] Ro exp ( 2 2y*. 


Now, integrating with respect to % to obtain the marginal probability 
density for t*, we have 


2 (t*6)" 
op(—3) Sor x 


Qrup =) rao 
ENE F] ec ( N 5, 
i} xX exp [ z x” * exp | — £) dy 


= [zener re 7 yy" ap ( : AP > er S), 


F= ie ner? exp [2 = + ce 


where 


y2 
=s m2) (n+r—2)/2 — m 1 *2 x 
[Fo exp [ seu + ey] at 
= 2intr—2)/2 (1 $ payons |" pinsava e” dw 
0 


= 2(n+r-2V/2 (1 p p*2j-intry2 p f + ‘). 
2 


2.3] THE THEORY OF HYPOTHESIS TESTING 101 


Therefore the induced probability density function for t* is 


r (" + ‘) 
eso V2 (ey 
3.47 1/2 area eS ee 
Gun a exp ( z) >?" (ie ; UF ae 
2 


r=0 


e) 


iy = f (1 4. t#2)n/2° 
bax, 


and, if ô = 0, is 
(3.48) ale 


This last expression is the density function for (n — 1)-/? times a random 
variable with the Student distribution. 

The distribution of ¢* depends only on 6 and n. n has a given value for 
this problem while 6 is a parameter which indexes the different distributions 
of 7*. In agreement with Theorem 3.8 we find that the parameter 6 is 
constant over each set of the maximal invariant partition; in fact 
6 = n/2x/¢, and ujo is obviously a maximal invariant function. 

The problem for the induced distributions of ¢* is 

Hypothesis: 6 <0, 
a Alini: 6> 0. 
We consider first a simple alternative 6 = ô. At least favorable dis- 
tribution for the parameter of the hypothesis would seemingly be to apply 
all probability to the single value 6=0. We apply the fundamental 
lemma to the simplified problem, 

Hypothesis: ô = 0, 
cso E, ô= ô, 
and obtain, as the most powerful test function, 


n+r 
= ott 2 ) ke) 


ô? e 
m12 exp (- 5) ha (=) OF eye 
2 


ot) =1 if n F 
Tis 
1/2 G) PA. 
2 
=0 ZE 
ð [n+r (y 
i 7/2. „ri | ae ead 
if sxe r/ 2 a 
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We shall show that the function in the last line is a monotone-increasing 
function of r*, but first we complete the derivation of the most powerful 
test for our problem. da= if Si 


=0 e] 
Using the probability distribution (ô = 0) which is very simply related to 
- the Student distribution, we find that c= (n — 1)™?t, where t, is 
exceeded with probability « according to the Student distribution with 
n— | degrees of freedom. Therefore 
d(t*) = 1 if t* > (an — 1)? 

=0 < (n — 1) ta 
To show that our choice of A distribution was least favorable, we must 
prove that (z*) is of correct size for the composite hypothesis in (3.49). 

E,{f(T*)} = Pr, {T* > (n — 1)-¥?t,} 
= Pr, E > (a — 1)-¥*1, 
Pr, {Z — ô > (n — 1) "ty — ô}, 
where Z — 6 has the normal distribution with mean 0 and variance 1 and 
z is independent of Z. Then for ô < 0 we have 
Es{$(T*)} < Prs {Z — ô > (n — 1)-¥4,7} 

= Pr {Z > (n — 1)"?i} 

= t, 
with the last step obtained directly from the definition for ¢,. Thus the 
test £(t*) is a size-« test for the hypothesis H : 6 <0. Also, the test does 
not depend on 6,. It is therefore a uniformly most powerful test for the 
problem expressed in terms of ¢*. From the theory at the beginning of 
this section, it is then the most powerful invariant test for the original 
problem (3.44). It is to be noted that the test is the same as the ordinary 


est! a) =1 if Poa 1%, 
=0 < (n — 1)-¥*t,, 

1/2 
if D > (a= De 


v 
< (n — 1) Y ty 
neg 
i, SS — 1)-1/2 
i Ee, pT > (n — 1)" 
< (n = HT 
neg 


if 


a 


< hs 
where sž is the sample variance, (n — 1)1X(a; — #)?. 


Se 
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We now show that, for 6 > 0, 

B ye p (==) CA 
Fo r 2 J0 +P”? 
is a monotone-increasing function of t. Since t/(1 + #?)/? is monotone- 
increasing and since 2/* can be combined with the arbitrary positive 6, 
it suffices to show that 


or that 


is monotone-increasing. Since /,(t) is, except for positive constants, the 
ratio of two nonzero probability density functions, it must be positive 


everywhere. Also it is easily seen that 
d 
—A,(t) = Iya). 
dt 1,(t) alt) 


The function h(t) has a positive derivative. Hence it must be a mono- 
tone-increasing function. 


3.8. Stringency. Fora number of problems we looked for test functions 
having maximum power for each parameter value of the alternatives. 
When such test functions did not exist we restricted our attention to those 
Satisfying some reasonable property such as unbiasedness, similarity, or 
invariance and then again looked for tests having maximum power for 
each parameter value of the alternative. However, some of the simplest 
Problems do not have such tests. We need, therefore, for choosing a test, 
Some criterion which provides a compromise to maximizing the power 
uniformly over the alternative. We now formulate such a criterlon— 
stringency. 

First we define a function ca 
maximum power attainable for each parameter 

The envelope power function for tests of size « is 


(8) = sup P,(0), 


ver the test functions $(®) of size a: 


Iled the envelope power which exhibits the 
arameter value of the alternative. 


where the supremum is taken o 
Ex{f(X)} S & 
Jor all 0 €w. 
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A reasonable thing to examine for any test function is the amount by which 
its power falls short of the maximum possible at each parameter value of 
the alternative: 


B.(8) — P,(8). 


Earlier in this section our attempts have been to effectively minimize this 
expression for each 0 in the alternative. Here we take B,(9) — P,(8) 


1 


Bal) 


£- Bal0) — Po (0) 
| 


Q-w Hw a Q-w 


Figure 14. The envelope power (0). 


as a measure of the “shortcoming” of the test ¢ for the alternative 0; then 


sup ($40) — P,(9)) 


OEN -w 


is the most extreme shortcoming of the test under the alternative. We use 


sup [f,(0) — P,(0)] 
OE- w 
to compare different tests; a test that minimizes this is called a most 
stringent test. 
A test function $(x) is a most stringent size « test if it is of size « 
and if 
sup [#,(0) — P,(0)] S, Sup (2.(9) — P,.(9)] 


OE- w 


—o 


for any other size x test, 6*(x). 


b One possible procedure for obtaining most stringent tests is given in the 
following theorem. 


THEOREM 3.9. (HUNT AND STEIN). If 2 — w is partitioned into disjoint subsets 
Qs such that the envelope power /i(#) is constant on each O, and if (x), the 
test that maximizes inf P,(0), is independent of 5 (a(x) = 4(x)), then ¢ is 
most stringent. DEN, 
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Proof. Since f(8) is constant over Qs, maximizing inf P,(0) is clearly 
OEN, 


equivalent to minimizing sup —P¿(0) or sup (A(9) — P,(0)). Therefore the 
test function (x) miis o (0) — P,(0) for each ô. But the minimizing 
of = (O) — P,(0)) for eack ô implies that sup Pa (B0) — P0) is 
minimized. The theorem follows by noting that the operation sup sup is 


P ó 0ENs 
equivalent to sup . 
OEN- w eed f 
To apply this theorem, however, we need a method for obtaining tests which 


satisfy a size condition 
Eqig(X)} S % 


and which maximize the minimum power 


inf P,(0) 
0EN, 
Over a composite alternative Q. This can sometimes be accomplished by the 
method of least favorable distributions introduced in Section 3.3. We try to 
find a P; (0 in Q5) or a weighted average of P,'s (0 in 25) which most resembles 
the measures of the hypothesis. If we find the test that maximizes the power for 
this representative alternative and if the power elsewhere in the alternative Qs 
is at least as large, then the test maximizes the minimum power over 25. For, 


if (x) is the size « test that maximizes the power for the alternative 


[ P(A) dn(0), 
Jas 


then, using the assumption that the power is at least as large for the other 0's 


in the alternative, we have 


inf Px) = [ P4(0) di() 
dEN5 JAg 2 
>Í P ș.(0) dn(9) 
JAg 
> inf Py.(4) 
Qs 


This proves the following theorem. 


THEOREM 3.10. If (x) is the size-x test that maximizes the power for the simple 


alternative P4(A) dy(0) and if this maximized power is less than or equal 


Q es, soe 
the test’s power at each 0 in Qx, then 4(x) maximizes the minimum power over 


a. 
onditions are satisfied which 
th respect to x and 6. As 
blems 39, 40. 


This theorem assumes that the measurability cc 
allow the interchange of order of integration wi 
examples of the application of this theorem see Pro 
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Another possible procedure for obtaining most stringent tests is provided by 
the method of invariance. For this we need to define four types of transformation 
groups. 

G) {ss =+ cle €]—«, +a[}, wa real variable. 

Gi) {sz = ax|a €]0, ~[}, x a real variable. 
(iii) The group of orthogonal transformations on a Euclidean space. 
(iv) Any finite group. 


We give without proof the following lemma of Hunt and Stein. 


THEOREM 3.11. (HUNT AND STEIN). If % can be factored by normal sub- 
groups such that the normal subgroup at each stage and the final factor group 
are of types (i), « - -, (iv) then, for any function ¢(x) over Z (0 < 4(x) < 1), there 
exists a function y(x) invariant under ¥ (0 < (x) < 1) such that 


int Í ose) po dx) F v(2) pl) dula) < sup i $sx) pla) dula). 
sEG IE g seg JX 


for all integrable functions p(x). 


Let 9 be a group of transformations which leaves invariant the hypothesis 
testing problem 
Hypothesis: 0 €w, 


Alternative: 0EQ — w; 


and let m(x) and 77(@) be the maximal invariant functions over Z and Q. Then 
we have the following theorem of Hunt and Stein. 


THEOREM 3.12. (HUNT AND STEIN). If 9 satisfies the conditions in Theorem 
3.11, if the measures {Po|9 € Q} are dominated by a measure x(x), and if there is 
a most powerful invariant size-x test for the alternative m(0) =m, then this 
test, among size-« tests, maximizes the minimum power over those 0’s having 
m(0) =m. 


Proof. First we note by Theorem 3.8 that 7(0) = m gives a simple alternative 
for the problem of finding invariant tests. For the invariant partition of 2 
induced by (0), let Qz be the set for which 7(0) = ñm. Also let 0, 0’ be typical 


elements of Qp. Since ¥ is a group, there is a transformation in Z such that 
0’ = s0. Therefore we have 


inf Í $(sxPo(x) du(x) = inf Ey{d(sX)} 
seg Jr seg 


= inf Eyg{¢(X)} 
seg 


Sk Eg {H(X)}. 


m 
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Now applying Theorem 3.11, we have the existence of an invariant test function 
w(x) such that 
inf Ey {¢(X)} < f p(x) pol) dufa) 
Emn Jz 
< Eqty(X)}- 
for 0 E Q; 


. Ofcourse Ey {y(X)} is constant over Q, as is easily seen b; lyi 
m 7 ine ly seen by applying 
Theorem 3.8 to the invariant function y(x). 


Now by the result in the theorem above, we may restrict our attention to 


invariant test functions if our only concern is to maximize inf P¿(0). Hence, 
DENT 


nvariant function, it maximizes the 


if there is a uniformly most powerful i 
tion induced by M(0). 


minimum power over each set of the parti 


Turorem 3.13. If @ satisfies the condition in Theorem 3.11 and if the 
Measures {Po|0 €Q} are dominated by a measure (2), then a uniformly most 
powerful invariant test is most stringent. 


Proof. Follows directly from Theorems 3.9 and 3.12. 


then the next theorem used with either 


If we introduce a loss function, 
ble procedure for obtaining 


Theorem 3.10 or Theorem 3.12 provides a possi 
minimax test functions. 


THEOREM 3.14. (LEHMANN). Ifa test function maximizes the minimum power 
over each set of a partition Qs, then it has minimax risk with respect to any loss 
function which is constant on each set 9. 

, Note. Because of the size restriction on 
ignore that part of the loss function concerne! 
the hypothesis; in effect, we let W0) = 0. 


the tests under consideration, we 
d with an incorrect decision under 


an incorrect decision under the 


Proof. Let W,(0) be the loss function for 
valued over each set Qs. Then, 


alternative, and by assumption it is constant- 
if $(x) maximizes for each ò 


i inf Py.(9); 
it minimizes ' nang 
sup (1 — P4-(9)) 
Or equivalently minimizes PENS 
sup WOU — PeO) 
0EN5 
= sup R,+(9). 
0EQ5 


o sup, we have that (x) 


uivalent operation t 
0EQ-O 


But, since sup sup is an eq 
Shien ò 0ENs ee 
minimizes the maximum risk—is minimax. 
rdinary F test of the 


Problem 36 is to use these theorems to show that the ordi 
and minimax with respect to any 


general linear hypothesis is most stringent, 
Invariant loss function. 4 
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3.9. Consistency and Efficiency. In many statistical problems the out- 
come x is a point in a product space Z = Z, x --- x Z, where the 
FX, are identical spaces, say Z. Also, each probability measure is the 
power product of a measure on Z. It is then of interest to inquire 
whether a test defined for each “sample size” n has good properties when 7 
is large. One such property is that of consistency. 

Let Př be a probability measure over Z, and let PX be the power- 
product measure over ¥ = F, X +++ x Za; then 

A sequence of size a test functions {¢,(x)} for the hypothesis: 0 €% is 

consistent for the alternative 0 EQ — w if 


lim P, (6)=1 

for each 0 EQ — w. aii 
This means that, for any 0 EQ — w, the power of ¢, can be made 
arbitrarily close to 1 by taking n sufficiently large. 

Often it is desirable to compare sequences of test functions for certain 
values of the parameter in the alternative. Since most test sequences 
under consideration for any problem will be consistent over most of the 
alternative, it is usually necessary to take values of 0 that change with the 
sample size and become “close” to the hypothesis as the sample size 
becomes large. Let {4,,}, {6%} be two sequences of size-x tests. Also 
let {n;}, {n7} be two increasing sequences of positive integers such that 

lim P, (0)= lim Py. (0) 
too . a) Be 
with the two limits existing not equal 0 or 1. Then 

The relative efficiency of {pna} re {6%} is 

nf 


elh = lim = 
if the limit exists the same for all sequences {n;}, {n¥}. 
Thus efficiency is the limiting ratio of sample sizes such that the tests have 
the same limiting power for the sequence {6,} in the alternative. 


4. CONFIDENCE REGIONS 


In Section 2 we consider the estimation of real parameters. The 
purpose there was to find a procedure for calculating from the outcome an 
estimate which on the average would be close to the parameter value. 
Here, our purpose is quite similar. We want to find an interval or region 
which on the average will tend to contain the correct parameter value and 
not contain other values of it. Surprisingly though, the technique and 
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analysis for this fit closely the work of the preceding section on hypothesis 
testing. In this section we shall define confidence regions and establish 
the analogy with hypothesis testing. An example to illustrate the con- 
nection with the material in Section 3 and a series of problems in Section 6 
will complete the general discussion. 

We first illustrate the idea of confidence regions by means of an example. 
Let X}, -- -, X, be independent and each be normally distributed with mean 
wand variance 1. (u €]—, +D. Suppose we are interested in the 
parameter y, but, instead of wanting an estimate d(x, * * *, %,) Which comes 
close to jz, we want to calculate a set on the real line which we hope will 
contain the value of the parameter u. Consider the interval [ë — 1.96n-¥/?, 
&+ 1,96n-¥/2]. It is calculated from the outcome (2;,***, #,), and, as we 
shall see Jater, it tends to do better than other sets as far as containing the 
actual value of x, and not containing other values of x. 

With a confidence interval we associate a confidence level, a number f 
(often 0.90, 0.95, or 0.99). This is the probability with which the con- 
fidence region contains the actual value of the parameter. For our example 
the confidence level is 0.95; that is, with probability 0.95 the interval 
LĒ + 1.95n-1/2] will contain the actual value of u. We check this pro- 


bability statement: 


Pr,{u e [X + 1.9677" ?]} 


= Pr, {X — 196r" <u L X + 1.9602} 


= Pr, {f -p — 196r Y2? L0 < Ž— u+ 1.967 -"°} 


=Pi,fe+ 196r > Zu- 1.96n-!/?} 
= Pr,{—1.96 < ase < +1.96} 
no 
= Pr{—1.96 < Z < +1.96} 
= 0.95, 


variable with the normal distribution having 


he end of this section we shall prove that in a 
fidence region with confidence 


Where Z designates a random 
mean 0 and variance 1. Att 
certain sense this interval is the best con 
level 0.95. 

We first give a general definitio 
Shall consider some optimum proper 
a class of probability measures over 
taking its values in a parameter space H. ; 
valued as we required in the theory of estimation. 


n of confidence regions, and then we 
ties to look for. Let {P,|9€ Q} be 
(A), and let (0) be a parameter 
n(0) need not be real or vector- 
The function S(x) from 
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X into the space of subsets of H is a B confidence region for (9) if 


Pry {y(8) € S(X)} > B, 
or equivalently if 
P,({x| (0) € S(x)}) > B. 


It is to be noted that the probability statement concerns a condition on the 
random variable X. This definition describes a function S(x), which 
chooses a set of possible values for the parameter (9) such that the 


| 
| 
| 
| 
| 
L 


£ 


Figure 15. The structure of a confidence region S(x). 


probability is at least that these values will contain the actual value of 
the parameter in the experiment. 

In line with the use of randomized decision functions we now define a 
randomized confidence region. 

The function S(x, r) from X X [0,1] into the space of subsets of H is 

a B confidence region for n(0) if 


Profn(0) E€ S(X, R)} > B 
where R has the uniform distribution [0, 1]. 


There is nothing essential in the use of a uniform distribution or in the use 
of a real-valued random variable R. It is just a convenient method of 
introducing randomness and‘with the availability of random digits is easy 
to apply. 

In Fig. 15 we exhibit the form of a confidence region S(x) in relation to 
the sample space 2 and the parameter space H. For a given x, S(x) is a 
subset of H. However, for the construction of a confidence region we are 
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interested in the section of the combined confidence regions corresponding 
to a given value of the parameter, say n’. The resulting set A(n’) = 
a| 7’ € S(a)} is a subset of Z, and we assume it measurable for each 7’. 
The probability statement says that for each Po having 7(0) = 7’ the 
probability measure of A(7j’) must be Bormore. Hence, confidence regions 
can always be constructed in the following manner. For each 77’ a subset 
A(n') of Z is determined such that 


PLAN’) = Ê 
A B confidence region is the « intersection of 


for each 0 having ⁄(0) = 7)’. 
these sets arrayed as in Fig. 15: 


S(x) = {|x € A} 
Obviously S(x) satisfies the requirements in the definition. 
For a confidence region it is convenient to define a characteristic function 
¢,(x). In the nonrandomized case 
$,(«) = 1 if 1 €S(z) 
=0 € S(x), 


and for the randomized case 
4, (0) = Prr {n E S, R)}, 

with respect to the random variable R. In 
hink of ¢,(x) as a function of two variables, 
for the combined confidence regions S 
as exhibited in Fig. 15, and, if we think of it as a function of v for a given 
n, it is the characteristic function for the horizontal section A(n), Which of 
Course must have probability at least B for the corresponding 0’s. Fora 
given x in the randomized case, $,() gives a “probability evaluation” of 
the different 7’s in the parameter space H. 

In terms of the characteristic function we can re 
Confidence region. 


The function S(x) from & into the sp 
fidence region for n if the corresponding 


where the probability is taken 
the nonrandomized case, if we t 
then it is the characteristic function 


state our definition of a 


ace of subsets of H is a B con- 
characteristic function satisfies 


Elpro) = b. 
for all 0 €Q. 
THEOREM 4.1. For any function (œ) with values in [0,1] and 
satisfying 


Eq{$yo(X)} = B, 


there corresponds at least one p confidence region. 
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Proof. We define a function S(x, r) by the equation, 
S(x, r) = {n| $€) = r} 
This is a J confidence region since 
Pry {7(0) € S(X, R)} 
= Eg{Prp [n(0) € S(X, R)I} 
= E,{Prp [¢,(X) > R}} 
= Ex{b,(X)} 
>ê. 

The theorem and the definition preceding it show that, for any con- 
struction or theory of confidence regions, we can equivalently work with 
characteristic functions. It will be by means of characteristic functions 
that we establish the connection with hypothesis testing. However, to do 


this it is convenient to introduce a function very simply related to the 
characteristic function; we call it the auxiliary function p(x): 


p(x) = 1 — $,(2). 
It is the characteristic function for the complement of the confidence 
region in the nonrandomized case. In terms of y,(x) the condition for a 
B confidence region becomes 


Efo (X) < 1 — B. 


It is this condiiion that we associate with the size in hypothesis testing. 
When £ is one of the usual values 0,90, 0.95, 0.99, then 1 — £ is one of the 
values 0.10, 0.05, 0.01 which are frequently used for the size in hypothesis 
testing. 

For a clear picture of the relationship between a confidence region and 
the corresponding hypothesis testing problems, it is helpful to give a 
verbal description. Suppose y,(x) is a size 1 — f test for the problem 


Hypothesis: (0) = n, 
Alternative: (0) Æ 7. 


Then for a given outcome we “test” each y and decide whether to reject it 
or accept it. The 7’s we reject become points outside the confidence 
region; those we accept become points inside the confidence region. This 
constructs a confidence region for that x, and by the theory above it will 
have confidence level f. 

We define the power of a confidence region: 
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The power of the confidence region S(x, r) at the parameter value y is 
P(O; n) = En{y,(X)}. 

The power P(0; 7) of a confidence region is the probability that it does not 
cover y when the probability measure is given by 6. Of course, when 
q= (9), we want this probability to be less than 1 — f in order to satisfy 
the size condition. However, when 7 + 7(9) the power is the probability 
of not covering a value 7 which is not the value-of the parameter 7/(9) in the 

experiment. This probability we want to be large. 
In terms of the auxiliary function we have the condition for a confidence 
region: 
PO; n) = Ex{y,(X)} < 1 — Ê 
for all 0, having ņ = n(0); and, in order to obtain a good confidence 
region, we want to maximize 
PO; n) = Eotp(X)} 
Jor all 0, » having n # n(0). 
To establish the connection with hypothesis testing we partition the 
space Q by the function y(0) and obtain sets Q,: 
Q, = {0|n() = n} 
Now by examining the conditions above it is seen that to obtain the 
auxiliary function of a good £ confidence region is to obtain a good test 
function y,(x) of size 1 — f for the hypothesis testing problem, 
Hypothesis: 0€2,, 


Alternative: 0 ¢ Q, 

nection with hypothesis testing 
most powerful similar, 
t stringent confidence 


Then by applying the straightforward con 
we can have most powerful confidence regions, 
unbiased or invariant confidence regions, or mos 
regions. 

-, X„ be independent and each be normally 


EXAMPLE 4.1. Let Xpt’ 
o?, where u € R! and o? € ]0, of. 


distributed with mean x and variance 
Consider the problem of finding a £ confidence region for the parameter x. 


According to the theory above we partition Q into sets 
Qy =H o?)|o? € JO, ol}, 


and look for a good test function of size 1 — f for the problem: 


Hypothesis: x = uw, o?e], ol, 


Alternative: p 4H’, 0 E], oof. 


114 STATISTICAL INFERENCE [2.4 


By Problem 35 the ordinary ¢ test is most stringent; it is 


m?|e— g| 


ye) = 1 if > la-ppe 


Se 

= < ta-p2 
where t, is the point exceeded with probability « according to Student’s 
distribution with n — 1 degrees of freedom. Now, transferring to the 
characteristic function for the confidence region, we obtain 


(x) = 1 if esl 
= 0 > ta -px 
The confidence region S(x) is then given by 
S(x) = {u|b,(x) = 1} 
= {u| — u| < nsa po} 
= [FE E msada aye] 


This is a most stringent f confidence region. 


< ta -p2 


Sz 


5. TOLERANCE LIMITS 


When an article is mass-produced, a certain amount of variability is 
inevitable, but, if an individual item from production deviates excessively 
from the desired form, it may be unacceptable for the use to which the 
articles are being put. The limits that divide an acceptable article from a 
defective article are called tolerance limits—limits that define the tolerable 
variability. In many applications, production is considered satisfactory if 
95% (or 99%) of the articles fall within the tolerance limits. 

Statistical tolerance limits are somewhat different. After production 
has commenced, a sample from production may be used to calculate limits 
which the statistician feels reasonably sure contain a fraction, say 95%, of 
the articles being produced. Such limits are called statistical tolerance 
limits, or merely tolerance limits when there is no chance of confusion. 
In effect, the statistician on the basis of a sample is trying to find a region 
in the range of the article’s variability, containing 95% of production, a 
statistical tolerance region. A statistical tolerance region may be cal- 
culated for various reasons: to find out what sort of an article is being 
produced, to compare with the tolerance limits to see if most of the 
articles being produced are acceptable, or to keep a check on the production 
equipment to see that it produces the acceptable type of article. We shall 
consider only the statistical tolerance regions. 
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Asan example consider ¥ = (Xj, * + +, X,) where the X, are independently 
distributed and each has the same probability measure on the real line. 
Let F,(x) be the distribution function for an X, and assume that it is a 
continuous function. If we take the smallest, x4), and the largest, x4), of 
the measurements in the outcome (2,,°--,) and form the interval 
[£a tu], then we have a region in the space of the probability measure 
being sampled, and this region is a function of the outcome (2, * + +, x4). 

We are interested in the proportion of future sampled articles in this 
region; that is, in the probability in this region: 

Pillar May) = Foy) — Fola). 
This probability is a function of the outcome (2, * «+, 2) and hence has an 
induced probability distribution corresponding to the distribution F,(x) 


for each x, We now calculate the probability that the interval [X(,), X(4)] 
contains 95% of the probability for the distribution F(x). 


Pry {Po([X ays X) = 0.95} = Pro {Fo(X ay) — F(X) = 0.95}. 


This probability is simple to evaluate because Fo(X(4)) — F(X) has a 
very simple probability distribution when the same value of 0 gives the 
distribution for the X;. As we shall see in Chapter 4, this is a ĝ distribution 


with parameters 3, 2. 
Pry {Po([Xi, Xw) = 0.95} 
ee) hn 
~ 1(3)0'(2) Joos 


0.05 
E 2f x1 — x)? dx 
0 


w(1 — x) dx 


ll 


0.05 
12 [ (a — 2x? + 23) dx 


J0 

x? as a 
= pe ae 
-n[F 2545}, 


z e — (0.000125) + 10.0000625) | 


= 0.015. 


Thus there is only 0-015 probability that the interval [Xo Xw] will 
contain 95% of the probability of the distribution. Certainly we need 
more than a sample of four to pin down 95% of the probability. , 

We now generalize these ideas on tolerance regions and consider a 
number of different definitions. Let {P)| € Q} be a class of probability 
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measures over the measurable space Z (sZ). We shall consider tolerance 
regions based on a sample of say from one of these probability measures. 
Therefore, the sample space for the problem will be 2”, and each measure 
in the class for Z” will be the power product of a measure in {P,|@ € Q}. 
Now by the above example we see that, for each value of the outcome 
(%,°**, a), We wish to associate a subset of the space 2. Accordingly, 
our first requirement is that a tolerance region S(x, +++, x,) be a mapping 
from X” into . The thing of interest about the region S(x, ++ +, x,) for 
a given outcome is the probability in the region as determined by the 
probability measure P, which gave rise to that outcome. The probability 
measure of S using Py is 
Po(S(@, °° *, Xn); 
it is a real number between 0 and 1. This function of the outcome has an 
induced probability distribution corresponding to the product measure of 
P, over 2". It is this distribution that tells us how the probability 
content of S(x, * >+, &„) varies in repeated sampling from a given prob- 
ability measure. If we are interested in how often the region S(2,, °°, %,) 
contains at least a proportion p of the probability, then we calculate 
Pry {Po(S(X1, «> +, X,)) > p}. 

DEFINITION 5.1. S(2,,***,%,) is a f tolerance region for a proportion 

pif 
inf ProfPy(S(Xy,** +, X_)) > p} = B. 
0ER 

The probability of containing a proportion p of the probability may change 
with 0. The lower bound of this probability is the confidence level with 
which we are able to assert that S(x,,---,2,) contains at least a pro- 
portion p of the probability. Using the frequency interpretation of 
probability, we can rephrase the definition as follows: S(2,, +, æ) is a P 
tolerance region for a proportion p if at least f of the time in repeated 
sampling the region S(X,,-*-, X,,) contains at least p of the probability 
as determined by the measure producing the sample elements 2;. 

Another definition which has much stronger requirements is sometimes 
used in the hope of obtaining tolerance regions with more regular 
behavior. 

DEFINITION 5.2. S(2,,°*+,#,) is a distribution-free tolerance region 
for {P,|0 E Q} over 2.7) if the induced probability distribution of 

PAS(En ** s a), 

corresponding to P, for each 2,, is independent of 0 € Q. 

The induced distribution function, say G,(p) for P,(S(X,,°- +, X„), is 
given by 

Gp) = Pry {Po S(%4, mats Mai Pp} 
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The definition then states that G,(p) = G(p). The example of a tolerance 
region at the beginning of this section was distribution-free. In fact, the 
distribution of the probability content of the region was a fixed ĝ distribu- 
tion. We shall see in Chapter 4 that there is a large class of tolerance 
regions for which the ĝ distribution is the induced distribution of the 


probability content of the region. 

It is possible to give analytic conditions under which a tolerance region 
satisfies Definition 5.2. For this we need to define the characteristic 
function of a tolerance region, ¢,(%, ***, ®n). 

Pln En) =l if YES, +++ ty) 
=0 E Slp ` $ +, By). 
Then it is easily seen that Po(S(%,°**, tn) = E} {$y(2, ° “y 2,)}, where 
the expectation applies to the random variable Y with probability 
measure P}. 

THEOREM 5.1. A necessary and sufficient condition that Sti, rh 
be a distribution-free tolerance region is that there exist a sequence 
%1, %a, ` * + such that 

r 
Puno #98, e — Ot Il Pyn sie) = Oy t 
j 
are, respectively, unbiased estimators of zero over REL aieia (QNET iol 
for the power-product measures of {P|0 € Q}. The sequence 04, «g, °** 
is the moment sequence for the distribution of P,(S(%4, > * *s X,,))- 

Proof. A distribution-free tolerance region has the distribution function 
G,(p) independent of 0. Now, since a distribution function on a bounded 
interval is uniquely determined by the corresponding moment sequence 
(see [4]), it is equivalent to state that the moment sequence for Galp) 
should be independent of 0. The rth moment for G,(p) is given by «,. 


1 
a= [ p dGJp) 


Í IPASE °° + EaD TI Pol) 
an 


| 


ll 


i=l 


= | (EF Bren DI JY APAE) 
E- ted i= 

à [ [ears argo) | Th dPaed 
ande i= 

oi II by, (zp 7 +) TT aPy;) Il AP ((x;)- 


gnr jai 
is . - 
Therefore []4, (2, ***, tna) — %, iS an unbiased estimate of zero over 
F- Y; 
ji 
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"+", Thus the statement that F,(p) is independent of 0 is equivalent to 
the above expression being an unbiased estimate of zero for all r. 
Another definition for tolerance regions is concerned with the average or 
expected probability in a tolerance region. For it, we can set up a cor- 
respondence with hypothesis testing which will permit us to find tolerance 
regions with optimum properties. 
DEFINITION 5.3. S(a1,°**,%,) is a f-expectation tolerance region if 


E PASY =- X,))} = B 
for all 0 EQ. 
In terms of the characteristic function the above condition becomes 


Jaben e) TI dP) dP) = B. 


In order to introduce the notion of a good tolerance region we need a 
function that will tell us the relative merits of sets S in Z when the 
probability measure is P,. Let the “desirability” of a set S when the 
probability measure is 0 be given by a probability measure Q,(S) defined 
for all Se x. We assume that 


OS) = | fue) 4Pa(. 
Then the measure of merit or power of a tolerance region will be given by 


E,{QAS(%, °° + XD) 


or in terms of the characteristic function by 
n 
J Sew zd TT APLE fda) dP. 


Thus to find the characteristic function of a good tolerance region is to 
find a good similar test function ¢,(2,, * **, ®„) for the hypothesis testing 
problem 


Hypothesis: Y, Xi, ts Xn independent, each with measure Po 
(0 €Q), 


Alternative: X,,°:°, X, independent, each with measure Po 
Y independent of the X; and with measure Qy (0 € Q). 


If the test function should turn out to be randomized, then we need a 
definition of a randomized tolerance region; this can be given in the same 
way that a randomized confidence region was introduced in the last 
section. For further reading on the construction of best -expectation 
tolerance regions, see Fraser and Guttman [13]. 
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6. PROBLEMS FOR SOLUTION 


1. A complete class of decision functions is called minimal if it does not contain a 
Proper subset which is complete. Show that, if a minimal complete class exists, it is 


identical to the class of admissible decision functions. 
2. For Example 1.1 consider the class 27 of decision functions of the form 


>> 
d(x) = d, if —<b 


=d, =b; 


where b takes any real value. Sketch the power function for a typical b; also sketch 
the risk function, using the given loss function with a = 1. Find a minimax decision 
procedure. 3 
3. Let X have the probability measure P,(1) = p, P,(0) =q = 1 — p (binomial 
distribution with n = 1), and consider the estimation of p € [0, 1] when the loss function 
is 
Wd,p)=1 if |d—p|=t 


0 if |d—p|<t. 


Plot the risk function for the decision function d,(z) which estimates p = }, } when 


* = 0, 1, respectively: 
4,0) = ł 


d(l) = }. 
Show that sup Ra,(p) = 1. Consider a randomized estimator defined symbolically by 
p 


d(x) = Y where Y is a random variable with the uniform distribution [0, 1]. This 
estimator ignores the outcome and estimates by means of a number uniformly chosen 
from [0, 1]. Plot the risk function for d;(z). Show that sup Ra,(p) = ł. Which would 
you prefer on a mini is? A 

4i Wy) is a pon os defined over R* (or any open interval) show that 
W(y) is continuous, Hint: consider the derivatives to the left and right. 

5. Prove the last statement in Theorem 2.2. 


6. Prove Theorem 2.3. P 
7. Prove that the first two loss functions in formula (2.5) are strictly convex. 


ion i i ictly convex. 

8. Prove that the first loss function in formula (2.7) is strictly e , 

9. Prove that any sum of convex functions is convex; that any finite sum of strictly 
convex functions is strictly convex. 

10. Let X;,+--+, Xn be independent, an 
mean x and variance o*. : i 

(a) If u € R! and o? = 1, what is the complete sufficient statistic? A 

(b) If p = 0 and o? E ]0, œ[, what is the complete sufficient aaie 

(c) If Qu, o?) E R x J, œ[, what is the complete sufficient statistic? (See problems 
in Chapter 1. ; , 7 

For mach a the above cases find minimum-risk (convex-loss) unbiased eum ore 
for the parameters: x, 6%, 6, (14, 0°), E(X*), EX), E(X”), E(X), E(X — 198. 


d let each have the normal distribution with 
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11. Let X,,-°--, Xn be independent, and let each have the binomial distribution 
Pr(X¥ = 1) =p, Pr(X¥ = 0) =q =1—p. Show that Ex; is a complete sufficient 
statistic. Find minimum variance unbiased estimates of p, p(1 — p). 

12. Let Y;,---, Yn be independent, and let each have the Poisson distribution with 
mean m. What is the complete sufficient statistic? Find minimum-risk (convex-loss) 
unbiased estimates of m, m°. 

13. The hypergeometric distribution is defined by 


ea tame 


Pr(Y=y)= N (0 < y < min D, n). 
( n ) 
What s the complete sufficient statistic for the class corresponding to D = 0, 1, +++, n? 
Find a minimum variance unbiased estimate of D, D?. 
14. Let Xi, +++, Xn be independent, and let each have the same absolutely continuous 


distribution on R!. For the class of all such distributions, what is the complete sufficient 
statistic? Find minimum-variance, minimum-risk (convex-loss) unbiased estimates of 
u = E(X), 0° = E(X — p)?, (u, 0°), E(X*), E(X), E(X’), E(X), E(X — 1)’. 

15. Let X,, ***, X, be independent, and let each have the binomial distribution with 
parameter (n, p). What is a complete sufficient statistic? These values x, *'', 2r 
could correspond to the numbers of defectives in r successive lots from a production 
line. A parameter of interest might be the probability P that an individual lot passes 
a quality inspection. If the inspection plan is to accept or reject a lot according as 
x <2 or x > 2, an unbiased estimate of P is 1/r (number of lots accepted). Find a 
minimum-variance unbiased estimate. 

16. For the measures of Problem 12, find a minimum-variance unbiased estimate 
of e-", e-™(1 + m). 

17. Complete the proof of Theorem 2.7 by inserting the details necessary in the last 
paragraph. 

18. For the invariance theory in Section 2.3, show that each 5 maps 2 onto Q; 
€., is a one-to-one mapping. 

19, For the invariance theory in Section 2.3, show that 7 is a group homomorphic 
to Y, 

20. For the regression example at the end of Section 2.3, show that the suggested 
estimator has minimum risk among invariant estimators (loss function being a weighting 
of the squared errors). 

21. Prove that, if for tests in Z, there does not exist one having maximum power for 
the parameter value 0(0 E Q — w), there does exist a sequence of tests for which the 
power approaches the supremum. 

22. If for a hypothesis testing problem there does not exist a minimax test, show that 
at least there exists a sequence of tests having the minimax property in the limit. 

23. Show that for any two measures P(A), P’(A) over £ (2) there exists a dominating 


measure (£): P << p, P’ << p. 
24. If P(A) = | f(x) d(x), then prove that Pr {f(X¥)=0}=0. If X has the 
A 
measure P(A), prove that 
pr {S) _ i 
fœ +% or is undefined} = 0, 


where g(x) > 0. What are the implications for Theorem 3.1? 
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25. Let X;,°--, Xa be independent, and let each have the normal distribution 
with means 0 and variance o*. Find a most powerful size æ test for 


Hypothesis: 6 = Go, 


Alternative: 6 > Go, 
and hence for 
Hypothesis: o < oy, 


Alternative: o > Go. 
Find a most powerful size-« test for 

Hypothesis: o > Go, 

Alternative: o < Go. 


26. For the binomial distributions with parameters », p, find a most powerful size-« 
test for 
Hypothesis: p = Po, 


Alternative: p > Po, 
and for 
Hypothesis: p = Po, 


Alternative: p < po. 


27. Let X,, +++, X, be independent, and let each have the normal distribution with 
mean x and variance o%, Find a most powerful size-x test for 


Hypothesis: o= Go HE R 
Alternative: o = 0(<0), H = Hi 
For what larger alternative is the test most powerful? Is it most powerful against the 


Alternative: o < G» «ER? 


28. Let Xy +t, Xm Yatt, Yn be independent, and let each X; be normally distri- 
buted with mean x and variance g? and each Y; be normally distributed with mean 7 
and variance 72, If ¢ = 7 = 1, find a most powerful size-x test for 


Hypothesis: p=, ME R, 
Alternative: = Ja, V= (H <n). 


For what extended alternative is the test uniformly most powerful? 
29 (Continuation). If x = » = 0, find a most powerful size test for 


Hypothesis: o = 7€)0, of, 
Alternative: ¢=%, 7=71 


For what extended alternative is the test uniformly most powerful? 
30 (Continuation). Find a most powerful size-« test for 


Hypothesis: x, 7ER, 6=7E J0, œf, 


Alternative: H =Hn V=? = %, TET 


For what extended alternative is the test uniformly most powerful? 
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31. Let Xj, `- Xn be independent, and let each be normally distributed with mean 
and variance o*. Find a most powerful similar test of size « for 


Hypothesis: o= 6p HER, 
Alternative: 6 < Go MER. 
32. For the measures of Problem 31 find a most powerful unbiased test for 
Hypothesis: c > Go we Ri, 
Alternative: 6 < 6p eR. 


33. Let X have the binomial distribution with parameters n, p, and let Y have the 
binomial distribution with parameter m, p*. Find a most powerful similar size-x test for 


Hypothesis: p = p*, 
Alternative: p < p*. 


34. If P, = {A,}is an invariant partition of Z with respect to the group Y for each 
a belonging to an index set J, show that {(] A,| A, € Px} is a partition and is invariant. 
aer 


35. Let Xi, ***, Xn be independent and each be normally distributed with mean x 
and variance o*, For the problem 


Hypothesis: x =0, o€]0, of, 

Alternative: #0, o€]0, of, 
show that |2 [Ele — &)*)/? or #°/E(x; — 2)? is a maximal invariant function for the 
problem in terms of the sufficient statistic. What is the maximal invariant parameter ? 
Find a most powerful invariant test. Show that the test is most stringent and minimax 
re any invariant loss function. 


36. The general linear hypothesis problem can be described as follows: X, ***, Xn 
are independently distributed, and each X; is normally distributed with mean /t; and 
8 


variance o°. p; = X abes < n) and |la;;|| has rank s. Q = {(0°, 0, +++, 0) 0° 


E ]0, «[, 0; € R?}. The problem is 
Hypothesis: 6,,0, + ++ + 6,0, = cy, 
bn, + +++ + brb, = Cr 
Alternative: At least one inequality in the relations 
of the hypothesis. 


There are s + 1 parameters, o*, 0,, ++ +, 0,, and the hypothesis imposes r restrictions on 
thes 6’s. Many of the standard problems of regression theory and the analysis of variance 
are of this type. 

(a) Define the orthogonal transformations and changes of origin that put this problem 
into the canonical form: Y}, +=, Y, are independently distributed; Y,,---, Y, are 


normally distributed with means ;, +++, 7, and variance o?; and Yanri Yn are 
normally distributed with means 0 and variance o*; the problem is 
Hypothesis: 7, =---=7,=0, 


Alternative: At least one inequality in the hypothesis. 
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(b) Define the group of transformations which leave the general linear hypothesis 
problem invariant. Show that a maximal invariant (in terms of the Y’s) is 


or the corresponding F ratio (i.e., with the division by the degrees of freedom). 
(c) Show that the maximal invariant under the induced group on the parameter 


space is 


in terms of the parameters of the problem in canonical form. 
(d) The distribution of the noncentral F (again ignoring the division by the degrees of 


freedom) has the p.d.f. 
My + Ng J n) 


a 
—y* 2/2) ( $ My (2 
Sny+++no%(F) = exp (=) ba a A Oo A 
tao pia riea tne e h 

(+a) (2 ( ) z 


It is interesting to note that this is a weighted average of a series of F densities, starting 
with the central F and increasing the numerator degrees of freedom successively by 2, 
(again omitting the divisions by degrees of freedom). The weights are the Poisson 
probabilities with mean y?/2. Show that the most powerful invariant test is the ordinary 
F test. Show that the F test is minimax re any invariant loss function. 

37. Show that the two-sided ¢ test (Problem 35) is most stringent. Show that the F 
test (Problem 36) is most stringent. 

38. For the Problem 32 with the 


Hypothesis: o>, KER, 
Alternative: 0 < Co KHER, 


find a most powerful invariant size- test. Is it most stringent? Is it minimax with 


respect to any invariant loss function? 

39. Use Theorem 3.10 to find a most stringent test for Problem 38. 

40. Let Xi, ***, Xa be independent and each be normally distributed with mean 
“and variance g?. Apply Theorem 3.10 to find a size-x test which maximizes the mini- 
mum power over the alternative of the problem, 


Hypothesis: x = 0, 


Alternative: = +/4. 
What is a most stringent size-« test for 
Hypothesis: x = 0, 


Alternative: jo 4 0. 
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41. Let Xn 
and variance 6°. 

(a) If Q = {(u, o?)| u E R!, o? = 1} find a most stringent confidence region for x. 

(6) If Q = {ij o?)|u E R', o° € ]0, of}, find a most stringent confidence region 
for y. 

42. For the general linear hypothesis Problem 36, find a most stringent confidence 
region for a parameter, say 7; (in canonical form); for a 0;. 

43. Let Xi, ***, Xn be independent and each be normally distributed with mean x 
and variance o°. 
. (a) If 6° = 1 and power is obtained from the normal density with the same mean but 
with g? = e (<1), find a most stringent B-expectation tolerance region (the center of the 
distribution is being weighted). 

(b) If we R', c E]0, œ[and power is obtained from the normal density with the same 


mean but with variance decreased by a proportion e, find a most stringent /-expectation 
tolerance region. 


Xn be independent and each be normally distributed with mean x 
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CHAPTER 3 


Nonparametric Problems 


1. INTRODUCTION 


Much of the statistical theory developed in the past has been concerned 
with parametric problems. In these, the probability distribution has some 
simple functional form such as that of the normal distribution and is 
completely specified by one, two, or at most a countable number of real 
parameters. The essential feature is the finite or countable number of 
parameters, parameters in the traditional sense of real-valued parameters. 
There are some good reasons for this concentration on parametric 
problems. For many applications the normal, or some of the distributions 
derived therefrom, do resemble the theoretical distributions as indicated 
by repeated sampling. Second, and this is mainly a justification for the 
theoretician, it was for the normal distribution that much of the mathe- 
matical analysis was singularly tractable, and direct attempts to extend 
the analysis to other distribution forms led to great increases in complexity. 

More recently, much effort has been expended in trying to increase the 
field of application of statistics. This has taken place in two directions. 
The standard statistical procedures derived under the assumption of 
normal distributions have been examined under various modifications of 
the assumptions—usually that the functional form of the distribution has 
been altered in some simple manner. These investigations have been 
primarily concerned with the effect on the size of tests. We shall not be 
considering this approach, although we obtain, incidentally, some answers 
to the problems arising in this direction. The second approach has been 
to restate the standard problems in quite general terms and then look for 
adequate statistical procedures. In this case the class of probability 
distributions considered is quite large—so large, in fact, that it can no 
longer be indexed by a finite number of real parameters. This field of 
investigation has been given the title nonparametric statistics; that is, 
Statistics without parameters in the traditional sense of the term 


parameter. 
125 
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We propose the following as a rough description of nonparametric 
theory: that portion of statistical inference for which the parameter space 
cannot be simply represented as a subset of a real space of a finite number of 
dimensions. Unfortunately, this would include a simple sequential 
problem involving normal distributions and a countable number of means. 
Such a problem properly belongs to the parametric theory concerning 
normal distributions. On the other hand, a problem involving a sample 
froma continuous distribution function has a parameter space representable 
as a countable number of real coordinates (the values of the distribution 
function at the rationals). Such a problem we wish to call nonparametric. 
So without a clear-cut definition of nonparametric theory we emphasize 
that its purpose is the statistical treatment of the standard problems under 
quite general assumptions. 

In the remaining section we sketch a few nonparametric formulations 
for standard problems. For a first reading this may well be omitted since 
the problems are introduced one by one in the later chapters. They are 


collected here to amplify the discussion above and for comparison of one 
problem with another. 


2. SINGLE SAMPLE PROBLEMS 


> The basic assumption in a single sample problem is that a set of real- (or 
vector-) valued random variables forms a sample from a distribution over R! 
(or over R*), The problem is to test some hypothesis concerning this distribution, 
to estimate or form a confidence interval for some real-valued parameter, or to 
construct a tolerance region. For later reference we classify some of the more 
usual assumptions: 


ASSUMPTION 2. Xj, °* +, Xp are independent, each has the same distribution 
and either 

(a) X; has the distribution Py over R1; {Po|0 E€ Q} is the class of absolutely 
continuous distributions over R! and is equivalently given by the class { fo(7) |@ 
€ Q} of density functions re Lebesgue measure over R}, or 

(b) Xi = (Xas +++, Xa), has a distribution Po over R*; {P,|0 EQ) is the 
class of all discrete distributions over R* (each has probability on at most a 
countable number of points), or 

(© X; = (Xi +++, Xa) has a distribution Po over R"; {Py|0 €Q) is the class 
of absolutely continuous distributions over R® and is equivalently given by the 
class { fo(x)|0 € Q} of density functions re Lebesgue measure over R*, 

Some hypothesis testing problems that come under these assumptions are 
outlined in the following subsections. 

2.1. The Problem of Fit. Historically, this is perhaps the first problem with a 


nonparametric formulation. Karl Pearson as early as 1900 proposed the problem 
and offered the now classical 7 test of fit. 
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The problem of fit is to test whether a sample is from some particular distri- 
bution against the alternative that it is from some other distribution in the class 
{Po |0 E Q}; this is expressed by 


Hypothesis: 0 = 0o, 


(2.1) 
Alternative: 0 EN — 6. 


For application, however, the statistician might be more interested in testing 
whether a sample is from a distribution close to a particular distribution or from 
a quite different distribution. We can make this precise by introducing the notion 
of distance between two probability distributions. Let F,(7), Fo(x) be two 
distribution functions over the real line. Then we could define a “distance” 


between them as 
d(Fo, Fo) = sup | Fo(a) — Fola) | 
or 
da(Fo, Fo) = [ew — Fo(x))? dFo(x). 


The first definition satisfies the usual axioms for a measure of distance; namely 


(1) d,(Fo, Fo) = ay(For Fo), 
(2) d,(Fo, Fo) = 0, 
(3) d,(Fo, Fo) + di(Fo, For) = di(Fo, For). 


The second definition produces a directed distance which in general fails axiom 
(1). However, for our purposes it is satisfactory. We can modify these definitions 
by introducing a positive weight function W (u) to depend on the value of Fo(a): 


dy(Fo, Fy) = sup| Fox) — Fol) | W(Fo() 


d(F,, Fo) = Rare — Fox) PW (Fo(x)) dFo(x). 


Both these definitions in general produce directed distances. eee 
Using one of these definitions to measure the distance of a distribution from 


049 

d(Fo) = di(Fo, Fo), 

we can describe the modified hypothesis testing problem by 
Hypothesis: 9 € {6|d(Fy) < 4}, 


(2.2 
i Alternative: 6 € {0|d(Fo) > ô} 


or more compactly by 
Hypothesis: d(Fo) < ô, 


(2.3) 
Alternative: d(Fy) > ò. 
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2.2. The Problem of Location. This problem is concerned with the location 
ofa probability distribution. To describe the problem we need the concept of a 
location parameter—a real number which is calculated from a distribution and 
which measures where the distribution is ‘located’, In as general a class of distri- 
bution as given by Assumption 2a, the median is usually used. However, if the 
location of one or other end of the distribution was of prime importance, some 
other percentile could be used. In [2] the problem of defining a location parameter 
for a general class of distributions is considered, and, under a mild restriction 
on the transformation properties of the parameter, it is shown that the location 
parameter must be a percentile. We designate the pth percentile of a distri- 
bution F by £,(F), and we have, under Assumption 2a, the defining equation 
F(&,) = p. A more general definition is given by (2.4) in Chapter 2. 

One form of the location parameter problem is to test the hypothesis that the 
location parameter has a specified value £5 against the alternative that it has a 
larger value: 

Hypothesis: £ (Fy) = čo 0EQ, 
Alternative: (F) > &, 0EQ. 
Sometimes a more general hypothesis is wanted—that the location parameter 
takes a value less than or equal to the specified value £g: 


Hypothesis: §,(Fy) < &, 0EN, 
Alternative: & (Fo) > &, OEN. 


Such location problems are called one-sided because the alternative values of 
the location parameter are all larger than £, (or for the analogous problem are 
smaller), The two-sided location parameter problem is given by 
Hypothesis: &(Fy) = &, 0E9, 
Alternative: & (Fo) # &, 069. 

With the median as the location parameter there is another form of the 
problem which has had frequent consideration in the literature. It differs from 
those above in that there is an over-all assumption that the distributions are 


symmetric about the median. The probability measure Py with median £o,5 is 
said to be symmetric if 


(2.4) 


(2.5) 


(2.6) 


Potlfos — £, £0.51} = Pollfo.s, fos + 2D} 
for all positive x, In terms of the distribution function Fy the condition is 
FolS0.5 — £) = 1 — Fo(Eo,5 + — 0) 
for all positive x. Then a modified form of the one-sided location problem is 
Hypothesis: & 5(Fy) =, Fy symmetric, 0 eQ, 
Alternative: £5 5(Fo) > &, Fy symmetric, 0 €Q, 
and of the two-sided problem is 


(2.7) 


Hypothesis: £p 5(Fy) = £5, Fy symmetric, 0 €Q, 


(2.8) = 
Alternative: £o (Fp) + Éo Fy symmetric, 0 €Q. 
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2.3. Location and Symmetry. This problem differs from the last above in 
that the alternative is increased to include all distributions for which either the 
symmetry is lost, or the location parameter is different from the hypothesis value, 
or both. In short, the problem is to test whether the distribution is symmetric 
about a specified value. 

Hypothesis: £o,5(Fo), = os Fy symmetric, 0 EQ 


(2.9) : 
Alternative: £.5(Fo) #9 or Fo not symmetric, 0 €Q. 


2.4. Independence. The problem of independence is to test whether a 
vector-valued random variable has independent components. If the components 
are not independent, then there is a probability connection or association between 
them. Itis for this reason that a test for the problem of independence is some- 
times called a test for association. We describe the simplest form of the 
problem—to test whether two real random variables are independent. Let 
F(x), x2) be a bivariate distribution function, and let Fy(%")) and Fy'(®) 
be the corresponding marginal distribution functions; then the problem is 

Hypothesis: Fas, at?) = Fy (a) Fy (2) 
for all c, 2); 0EQ, 
(2.10) 
Alternative: F2”, v) + Fy) F (e) 


for some t™, 2; OEN. 


3. RANDOMNESS PROBLEMS 


In the single-sample problems considered above the over-all assumption was 
that a set of n random variables forms a sample from some probability distri- 
bution. In randomness problems this assumption becomes the hypothesis and 
is tested against alternatives for which the 7 random variables are from different 
distributions or have a degreee of dependence. We give first some general 
assumptions. 

ASSUMPTION 3. Xj, °° *; Xn are independent, and either 

(a) X; has the distribution Po, over R}; {Pọ|0 € Q} is the class of absolutely 
Continuous distributions over R! and is equivalently given by the class 
{ fol) |0 € Q} of densities re Lebesgue measure; Cn? 0n) E o”, ae ; 

(b) X; = (Xi `“, Xin) has the distribution Po, over R”; RE 2} is iip 
class of absolutely continuous distributions over R” and is edune ently given by 
the class { fy(x)|@ e Q} of density functions; (0, ** `s On) E 2". 

3.1. The Two-Sample Problem. In the two-sample problem there are two sets 
of random variables, each being a sample from a probability distribution. The 
Problem is to test whether the distributions are the same; that is, whether the 
two samples can be regarded as a single sample. Let the first n, ¥;’s correspond 
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to one sample and the remaining 7,X;’s correspond to the second sample; then 
n =n + Ng. 


Hypothesis: X; has measure Po (i = 1, ++; n); 0EQ, 


G.1) . p 
Alternative: X; has measure Po, (i = 1, ++, m); 0 E9, 
Xn, +5 has measure Po, (j = 1, ++, na); 92 € Q. 
0, # ba. 
Sometimes a more restricted alternative is considered in which the distribution 
for one sample has the same form as for the second sample but is shifted in 


location. For example, if k = 1, we might have the two distribution functions 
under the alternative connected by F,(«) = F,(x + d) for all a(d # 0). 


3.2. c-Sample Problem. This is an extension of the previous problem to a 


consideration of c samples. Assuming n = m +*+- + ne we have 
G2) Hypothesis: X; has measure Py (i = 1, +++, n); 0EQ, 
` Alternative: X; has measure Pa Gi =1,°++,m); GEQ, 


Xn, +j has measure Po, (fj =1, °° +, n); 02 EQ, 


9), 0, © + +, 9, not all equal. 


A number of more restricted alternatives have been considered. In these the 
distributions for the different samples are usually assumed to be the same except 
for a shift of location; these are called slippage alternatives. 


3.3. The Regression Alternative. This alternative to randomness is a linear 
regression model in which the ‘errors’ are independent and identically distributed 
according to a distribution in the Q defined at the beginning of this section. 
More precisely, we assume that the random variables X; can be described by 


Xi = fe; + Yj, 


where cy, ***, Cp are given numbers for the experiment, and Y,,°°°, Yp are 
independent, and each has the same distribution in the class Q given in Assump- 
tion 3a. & is called the regression coefficient, the c; are values of the independent 
variable, and the Y; are the errors about regression. If we let č = n~} Zc;, then 
we can write 


X; = &c; — €) + (Y; + £8) 
= Ec; —@) + Yy, 


and once again the Y/ are independent, and each has the same distribution in Q. 
Thus, without loss of generality for our purposes here, we assume Ec; = 0. 
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The randomness problem with one-sided regression alternative is given by 


=0, 0€Q, 


Sre 


Hypothesis: 


(3.3) 
>0, 0EQ, 


Sre 


Alternative: 


and the two-sided problem by 


Hypothesis: &é=0, 0EQ, 


(3.4) 
Alternative: £40, 0EQ. 


3.4. Two-Sample Scale Problem. This problem for normal distributions is 
to test whether two distributions have the same variances. However, for the 
general class of distributions given by Assumption 3a the variance seems entirely 
unsuited to measuring the scale or spread of a distribution, particularly so because 
it takes the value + © for some quite simple distributions. There are reasonable 
nonparametric scale parameters such as the difference between two specified 
percentiles (see [2]), but the formulations we propose below do not need the 


definition of a scale parameter. 
Let Xi * y A be independent and each have the same distribution function 


F(x), 0E Q; also let Xab °° *s Xn, +n be independent and each have the same 
distribution function F,(~), 7 € Q. Then we have 
3.5) Hypothesis: F(x) = F(x +e) for alla; 6,7 EQ, 
Alternative: €p, (Fo) — Ep (Fo) < &,,(Fo) — Ep, (Fn) 

for all pp > p; 9, 7 € 2. 
A formulation with a more general alternative is given by 
Hypothesis: F(x) = F,(@ + ©) forall; 9,7 EQ, 
Alternative: Pr {| Y, — Y; | <| Y2 — Hsi 


where Y}, Ti designate random vari- 
ables with distribution Fo; Ys, Y, desig- 
nate random variables with distribution 
Fy; 9,0 €Q. 


(3.6) 


In each case the alternatives are one-sided with the second distribution more 


“spread out” than the first distribution. 


4. RANDOMIZED BLOCKS AND MORE GENERAL DESIGNS 


For the more general experimental designs there are many ways in which the 
assumptions can be made nonparametric. If the assumptions are too liberal, 
it can happen that a parameter to be estimated or about which a hypothesis is 
to be tested may be lost within the freedom of the probability distribution—be 
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nonidentifiable. This is one of the difficulties encountered with the general linear 
hypothesis. However, for the randomized block problem we pose several 
formulations of which two, as we shall see later, have solutions with quite 
satisfactory properties. 


4.1. Randomized Blocks. The randomized-block design corresponds to the 
classical agricultural experiment in which a number of treatments, say c, are 
applied randomly to c ‘plots’ in a ‘block’ of land, and then this is repeated a 
number of times to make a total of, say, b blocks. The first nonparametric 
formulation that comes to mind is to keep all the usual assumptions but the one 
concerning the distribution of the ‘error’ and add the general assumption that 
errors have some absolutely continuous distribution. Precisely, let X;; designate 
the random variable for the ith block and the jth treatment, and assume that 


Xy = e+ pip try + Yin 

where the Y,; are independent, and each has the same absolutely continuous 
distribution over R? with distribution function F(x), 0€ Q. Also, we assume 
without loss of generality that Ev; = Ep; = 0. 


Hypothesis: 7, = = = 7, = 0; «a, p,,°°*;pp»ER1; ben, 
(4.1) YP! 1 c Pı Po 


Alternative: Notall 7; =0; 4, py,-**,p,ER'; OER. 


We can generalize this model by allowing the errors to have different distri- 
butions from block to block but the same within any block. Let the Y;; be 


independent and each Y,; have an absolutely continuous distribution over R! 
with distribution function Fo (2), 0; EQ. 


Hypothesis: v+, 7% = 0; @, pu't, ERG; 04, +++, 0,€Q, 
Alternative: Not all y; =0; %, py, °° +, py © RI; 91, +++, EQ, 


(4.2) 


In some applications it is not unreasonable to expect the errors within a block 
to be dependent. Let (Xin +++, Xie), °° +, (Xon ++ Xba) be independent, and 


let (Xn, ++, Xie) have an absolutely continuous distribution over R° with 
density function fo (x, “++, Xie), 0; EQ. If we assume that each block has the 
same distribution, then one possible formulation is 

Hypothesis: ftis ° + -, Lie) symmetric in a, ++ 
(4.3) 9e9, 

Alternative: folta, ° +, Lie) not symmetric for all 


(ta, °° +, Vi); OEN, 


Since it is assumed that the treatments are randomly assigned within a block, 
then equivalence of treatments implies that in each block the distribution is 
symmetric with respect to the treatments; hence the hypothesis above. The 
alternative above is simply that the treatments are not equivalent. A more 
restricted alternative might be of more interest in many applications. 
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If we no longer assume the distributions identical from block to block, then 
the following formulation might be appropriate. 
Hypothesis: foe +++, Xie) Symmetric; 0,,°°*, 0,€9, 


(4.4) 
Alternative: MASEN +++, Tie) not symmetric for all 


(Ras Bids On 71, HED. 
This alternative is quite general. Once again a more restricted alternative may be 
preferred for some problems. 

A third type of formulation might be appropriate if we were little concerned 
about small differences between treatments but wished to test whether in some 
general sense they had an equivalent effect. Let (Xjqs °° *> Xieds a (Xo *s Xe) 
be independent and each (Yj, °°’, Xie) have a continuous distribution over R° 
with distribution function Fy(2, °°". “es OEN (Q here indexes all the con- 
tinuous distributions over R°). Then, if (Er y 9) designates a random 
variable with distribution function Fo(Y1, * ``» Ye), WE have 


Hypothesis: Pro{¥j, <77 < Yj) =(c)7; 9E, 


(4.5) 
< a = (c); 0ER. 


Alternative: Not all Prof Yj, < °°" 
(jx, ** +, je) represents a typical permutation of (1, * + ', a: 


4.2. The General Linear Hypothesis. The general linear hypothesis problem 
is to test whether a number of regression coefficients are equal to zero. Let 
(Xip +++, X,) be a random variable with structure 


e r 
-G ia > tan + X mdix + Yo 
j=l ‘=i 

where the a’s and b's are treated as a known set of constants in any application. 
The £’s and 1's are called regression coefficients, and the Y,’s are called error 
terms. 

For the first formulation we assume that Y,,°°* Vaate independent and that 
each has the same absolutely continuous distribution over R! with distribution 
function Fy(x), Let 0, € Q, index those absolutely continuous distribution 
functions having Fo(0) = 4; that is, having median equal to zero. 


Hypothesis: & =: = B= 0] Bs!’ n ER; 0EQ, 


(4.6) : 
Alternative: Not all &;=0; M's rE Rt; 0ER. 


The second formulation admits a degree of dependence among the errors. 


Let (Yi, ++", Yp) have an absolutely continuous distribution over R” with 


probability density function fo(Yi. °° + Yn)- Also let 9, € Q, index those density 


functions foly, <<- Yp) which are spherically symmetric about (0, + + +, 0); that 
is, for which fg can be written 

flr» °° > Yn) = poly toe + Yn) 
=0; m's FER; 0EQ, 


Hypothesis: & =' 
1ER; 0EQ. 


4.7 
GD Alternative: Not all &; = 0; p` 
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5. PROBLEMS FOR SOLUTION 


1. Show that d,(F, F’) defined in Section 2.1 satisfies the axioms for a distance 
function. 

2. Which axioms do ds, ds, d; satisfy ? 

3. The two-sample problem is a particular case of the regression alternative in Section 
3.3. Define the c; which give the two-sample problem. 

4. The randomized-block problem is a particular case of the general linear hypothesis 
problem. Define the constants a;,, b; which produce the randomized-block problems. q 
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CHAPTER 4 


The Estimation of Real Parameters 


and Tolerance Regions 


1. INTRODUCTION 


In parametric problems the model for an experiment often forms a unit 
by itself, there being no subspace with a distribution from which the 
Over-all random variable forms a sample. In such cases, however, the 
simple structure involving only a finite number of parameters usually 
permits sufficient control that parameters can be estimated and tests made 
without repetitions of the experiment. When the assumptions are 
weakened to make a problem nonparametric, there is a much greater need 
for repetitions so that the over-all random variable forms a sample from 
a distribution over a component space. Of course, there is need too for 
theory to cover the more complex problems, but little has been developed 
for this purpose. The estimation and tolerance region theory developed 
in this chapter will assume that the over-all random variable is a sample 
from a distribution over a component space. 


2. THE ESTIMATION OF REAL PARAMETERS 


Let X be a space with a class of probability measures {P,| 0 € Q}. 
For a sample of n from a distribution in this class, we have the sample 
space ¥ = Z, x +++ X Z, where each Z; is identical to 2. Also, we 
have the class {P#|0 e Q} where each measure PX is the product measure 
of Py.over each component space. In the examples at the end of this 
section, Z will be the real line or a Euclidean space R* of k dimensions and 
the class of distributions will comprise the absolutely continuous or the 
discrete distributions, or subclasses of these. (See Assumptions 2 in 
Chapter 3.) 

Of the different properties on which our choice of estimator can be 
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based, unbiasedness is most easily applied in nonparametric theory. The 
classical estimator of the median of a distribution on the real line is the 
sample median. And it is possible to show that there does not exist an 
unbiased estimator for the median of an arbitrary continuous distribution 
on the real line. Nevertheless we restrict ourselves to estimation based on 
unbiasedness and in passing indicate the need for a general theory of 
median estimation. 

A real-valued parameter g(9) is called estimable if it has an unbiased 
estimator; that is, if there exists a statistic f (x,,*+*,%,) such that 


OD Elf yes X= | See) TadP 


qn 


= g(6) 
for all 0€Q. An estimable parameter is sometimes called a regular 
parameter. 
Similarly, a vector-valued parameter g(0) = (g,(0), * + *, g,(0)) is called 
estimable if there exists a vector-valued statistic f (x4, **, #,)=(/fi(@"""> Xn), 
con f eya Such that 


2) Ex{f(X%-++ X)}= ( | wills 9) I AP), ** ) 


= (0) 


The degree m of an estimable parameter is defined to be the smallest 
sample size for which the parameter has an unbiased estimator; it is the 
minimum value of n for which there is an equation (2.1) or (2.2). 

Any unbiased estimator of a parameter based on the minimum sample 
size m is called a kernel. It is easily seen that there is always a symmetric 
kernel. For, if f(2,°+*,%,,) is a kernel, then there is a symmetric 
statistic f,(a, * * *, v,,) defined by 


1 
LAr, ee tn) = mi 2S sea. ts 


where the summation is over all permutations (i, +--+, im) of (l, +5 m). 
This statistic is an average of m! forms, each of which is an unbiased 
estimator of the parameter. From the properties of expectation it follows 
that the symmetric function f,(2,,-++,2,,) is an unbiased estimator and 
hence is a kernel of the parameter. 

It is interesting to note some properties of estimable parameters. If 
gı(0), g2(0) are estimable parameters of degrees m, my, then the sum 
£:(9) + g2(0) and the product g,(9) g.(0) are also estimable parameters and 
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have degrees, respectively, less than or equal to m = max (1, mə) and 
m + my. For, if f(t, ° ++. tm) is a kernel of g,(9), then 


m 


[ia En) HAE np] T Ped 


= 9,(9) + g:(0) 


and 
my+ ma 


f m +m; Aer Em Y €m + tade Tn mg) IT aP,(x;) 
dg’ isi 
= g,(0) g2(9). 


Thus we have unbiased estimators of degrees m = max (mm) and 
m + mg, respectively. As a more general result, it follows that any 
polynomial in estimable parameters is also an estimable parameter. If 
the parameters are vectors, then we interpret addition and multiplication 
to be the addition and multiplication of corresponding coordinates. 

Corresponding to any estimator SF Gis? 2% x,,) of an estimable parameter 
8(9), we define a U statistic for a sample of n(n > m): 


(2.3) itty, ++ t= > Silty 
Cc 


n sa. . z 
where the summation C is over all (n) combinations (i, ***» im) of m 


integers chosen from (1, + + *%, n), and f, is the symmetrized statistic correspon- 
ding to f (£i, ***, &m)- Of course, we could also write 


— m)! 
Ula, 145 Xp) = gam = S iy" Cin) 
! = 


where the summation is over all permutation P of m integers (iy ts bm 
chosen from 1, +++, n. From this last expression it is seen that Ulti” 2a) 


is the symmetrized form of f@, °° -,a,,) considered as a function of 
(x, +++, a). Now, since 


EAI (Xp e s Xm} = 8s 


we obtain easily that 
E, (UM. TEN XS} = g(0), 


and therefore that the U statistic is an unbiased estimator of g(0). 


al line and {P,|9 € Q} be the class of 


EXAMPLE 2.1. Let 2 be the re 
Now consider the three real-valued 


absolutely continuous distributions. 
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parameters Hy, oF, H% (yx, c% Stand for the mean and variance of the 
random variable X). 
x is obviously an estimable parameter because it can be written 


Hx = E,{X}. 


It can be estimated on the basis of a sample of one; hence it is of degree 1. 
The corresponding U statistic for a sample of n is n-1Xz;. 
u% is a simple polynomial in yz, and hence is estimable. We have 


ui = [E {X}? 
= E(X, Xə}; 


therefore u% is of degree less than or equal to 2. A simple example can 
be constructed to show that the degree is not 1 and hence is 2. (See 
Problem 2.) It is easily seen that the corresponding U statistic for a 
sample of n is 


1 
n(n — 1) > 


o can also be written as a polynomial in parameters which are 
obviously estimable: 


(2.4) ox = BAIX — px]*} 
= {X°} — [E {X}. 


Of the two parameters in terms of which o2 has been expressed, the first is 
obviously of degree 1 and the second was stated above to be of degree 2. 
For a sample of 2, 27 — 2,25 is an unbiased estimate of of: 


EX) — E,(X2X5) = o2. 


There cannot be an estimate from a sample of one since by rearrangement 
and noting the breakdown (2.4) we could obtain an estimate of (E(x)? 
based on a sample of one, and this is contradictory to the degree of 
[E2]? being 2. Hence of is a regular parameter of degree 2, and 
zi — 4%, is a kernel. The corresponding symmetric kernel is 


2 
T — Tita + 22 — ile 
eee alr eee Lae 
2 
A (a, — Xp)? 


2 
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For a sample of n the corresponding U statistic is 


; 2 (t: — %;)? 
Uen aed = a o 


1 2 
TE zle 1) Ex; 25 za 


i<j 


n i may Zei] 


1 1 
= (rE 
aye n(n — nf i) 


= 2 i [2a? — nz?) 


2 
= Sz, 


where & = n-1Za, and s? is the sample variance. 
The moments yz, of the distribution P, are all estimable since 


u = E,{X"} 
From this equation it is seen that the moments are estimable of degree 1 


and that æ” is a kernel for u, Fora sample of n the U statistic correspond- 
ing to æ” is 


the sample rth moment. 
The central moments jx, for P, are defined by 


pi = EAX — wy} 
=R E,{(X ma ELX) }- 


By expanding the rth power and taking the expectation it is seen that 1; is 
a polynomial in the moments 44, ***s #4 Hence it is estimable. 

From the polynomial relationship of the cumulants of P, to the moments, 
it is easily obtained that they too are estimable parameters. ne 

For the remaining results in this section we need a particular statistic 
defined over Z” and called the order statistic. In Section 7 of Chapter 1 
order statistic was defined for the case Z = Rt; we give the general 
definition now. As was mentioned in Chapter 1, statistics are used to 
condense the outcome of an experiment. For instance, if (a, ** +, #,) is 
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the outcome of an experiment, one condensation of the information in 
(a4,°°*,@,) is to record the different values 2,---,, but neglect to 
record the order in which these values occurred in the outcome. This can 
be described mathematically by means of the statistic t(x}, +++, x,,) called 
the order statistic and defined by 


(2.5) Geis 28 +, D = fay, os ae 


where the braces stand for the set consisting of the n points 2,,°**, Un 
This statistic loses the information about the order of the 2’s in the out- 
come (x1, ** +, %,,) because it records only the set of x’s and knowing a set 
does not carry any knowledge about ordering within the set. 

This definition of order statistic is equivalent to that introduced in 
Section 7 of Chapter 1. There the order statistic for (x, ***, „) with 
real-valued x; was defined to be (£a), * * + %,)), Where tay, ** *, Xen) are the 
number t't, e, arranged in order of magnitude, so that aq) < 
ttt Ktm Clearly the two definitions produce statistics which 
extract the same amount of information from the outcome (2, ** *, %»)+ 

We now describe a simple property of the order statistic (2, * * *, %n)- 
Let f (x1, * * +, «,,) be an arbitrary statistic over Z”. Then, if f(a, °* +) &n 
is a symmetric function, it can be written as a function of the order statistic 


L(y s Ea) = Atay, +5 E), 


and conversely, if f (x, * ++, %„) can be written as a function of the order 
Statistic, it is a symmetric function. The proof in both directions follows 
easily by noting that each of the two conditions is equivalent to stating 
that f(x,,+*+,%,) is constant-valued over the n! points (%;,,°° "> Yin 
corresponding to the n! permutations (ig ** *, 3,) OF UL, * > + 7): 

Now, returning to the probability structure descriptive of a sample of 7 
from a distribution in {P,|0 € Q}, we shall prove that the order statistic 
t(x) is a sufficient statistic. For this it is convenient to think of the statistic 
as a partition of the sample space Z”. We let f’(x,, -+ +, 2,,) designate the 
set of the partition containing the outcome (zt 2,). The set 
t(x,***,«,) then consists of the point (a,,+++,a,) and all points 
(Eitt t) obtained by permuting the coordinates æ ++, 0,3 t(x) 
contains 7! points (or fewer if some of the coordinates are equal). 

Let P, be one of the measures over Z”. We look for a convenient way 
of describing the induced distribution of the statistic t'(@,°**,%,). Aset 
of values attained by f’(a,,- ++, «,,) can be exhibited as a set in Z”, the 
union of the sets ¢’(a,, +++, 2,). It will, of course, be a set that is symmetric 
with respect to the n coordinates. The induced probability measure of a 
set of values of t’(a,,-+-,2,) is the PX measure of this symmetric set in 
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X”. Hence, the induced measure is given directly in terms of the measure 
over Z”, and integration with respect to it can be done over 2”. 

Let Ac Z” and P(A | {a,, +++, v,}) stand for the conditional probability 
of falling in the set A, given the order statistic {x,,*-+,%,}. We shall 
show that 
(2.6) P(A | feis tes Ta) = (A, a0 15 Bnd) 


n! 

is a determination of the conditional probability; i(A, {t1 ***, ®n}) is 
defined to be the number of the n! permutations (2;,,°**, x;,) that fall in 
A. Let B be a symmetric set in 2" standing for a set of values of the 
statistic t’(a,, +++, 2,). Examining the definition of conditional probability 
in Section 4 of Chapter 1, we need only show that the following equation 
holds for all B: 


i iA, fo 20)) Py ape 
en PRA y= | Be? Ty are, 


The left-hand side of this equation can be written 
n 
f halo En) TI dP) 
JB i=l 


where ¢,,(x,,+**,«,) is the characteristic function of the set A and is 
defined by 
alts %)=1 if (yrs e ded 
= 6 ¢ A. 


Since the probability measure over Z” is symmetric in the coordinates and 
so also is the set B, we have that the left-hand side is equal to 


| baler e) TT APAE, 
JB i=l 


) is any permutation of (1,°°**, 7). Therefore, we can 


where (i, °° *, È 
1 a z 
left-har 1! equal expressions obtained 


write the left-hand side as the average of the 7 
by taking the n! permutations of (1,°*°,7): 


> patie" %;,) a 
[ P TI dP. 
JB n! i=l 


It is easy to see that > Pati," "> æ; ) is just the number of permutations 

P . . 
of (æ, +++, x,) that fall in A; that is, is equal to i(A, {tr n}. This 
establishes the equality (2.7) above. 
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Examining the expression for the conditional probability P(A| {z 
+++, @,}), we see that it does not depend on 0. Therefore, t'(x) = 
{x1,+*+,%,} is a sufficient statistic for the class of measures {P30 EQ} 

n 
over & = [J F, defined at the beginning of this section. 
i=l 

We now have a theorem which in part shows the importance of U 
statistics in nonparametric estimation. 


THEOREM 2.1. If f(2y,+++,2,) is an unbiased estimator over Z” of 
the parameter g(9) re the measures {P}|0€Q} (the power-product 
measures of a sample of n), then the corresponding U statistic is also an 
unbiased estimator of g(9) and o7(0) < o7(0) with equality only if 
J (#1, ** +; £a) is equal the U statistic almost everywhere P}; and, using a 
strictly convex loss function R,(9)<R,(0), with equality only if 
J@n t 2a) is equal the U statistic almost everywhere PX. 


Note. If the variance or risk of fis +o, the variance or risk of the U 


Statistic may also be unbounded without the equality almost everywhere 
of fand U. 


Proof. The proof follows easily from the Rao-Blackwell Theorem 2.6 
in Chapter 2. We have proved that the order statistic Itp +++, %a) iS 
a sufficient statistic for our class of measures. To be able to use the 
Rao-Blackwell theorem and complete the proof, we need only show that 
the U statistic is the conditional expectation of f (xt, £p), given the 
order statistic {a ,---,«,}. Almost everywhere PX we have 


E(f(%, ai X,)| (a, AE &,}} 


1 
=F Df Cpt) 
sy 
=U Ta), 
where U(x,,++-,2,) is the U statistic corresponding to f(a, ***, n) 


The next theorem adds importance to the U statistic by showing that 


under suitable conditions the U statistic for an estimable parameter is 
essentially unique. 


THEOREM 2.2. If the order statistics I(x) = fx +, £p} is complete re 
the class P% over 2” (a sample of n from {P,|0 € QY}, then the U statistic 
corresponding to any estimable parameter is essentially unique, and it has 
uniformly minimum variance and risk (convex loss) among the unbiased 
estimators of the parameter. For vector estimation minimum variance is 
replaced by minimum concentration ellipsoid. 


4.2] THE ESTIMATION OF REAL PARAMETERS 143 


Note. Theorems 7.1, 7.2, 7.3 give sufficient conditions on {P,|0 € Q} 
for the order statistic to be complete. 


Note. We can permit an estimable parameter g(0) to take the values 
+20, —co and even not exist, provided that for a statistic unbiased for 


g(9) the integral 
gO = | Flom ed Th Pe, 


respectively, converges to -+-00, converges to —©0, and diverges. In such 
a case, we add the requirement that t(a,,°**;%,) be complete with respect 
to the subclass of distributions corresponding to Q) = {0|g(0) € R]; Qo 
consists of those 0 for which g(0) is finite. 

Proof. We proved above that the order statistic ¢(2,,°**,%,) Was a 
sufficient statistic. The theorem then follows immediately from the 
Lehmann-Scheffé Theorem 2.8 in Chapter 2. 


In Chapter 6 we shall prove a theorem concerning the distribution of a 
U statistic as the sample size n approaches infinity. If f (1, ***s m) isa 
statistic unbiased for the parameter g(0) and U,(%, "°° x,) is the cor- 
responding U statistic for a sample of n, then this theorem says that, as 
n—> ©, the distribution of n¥/2(U, — g(0)) approaches the normal distri- 
bution having mean zero and finite variance, provided only that the second 
moment of f (Xy ***, Xm) exists. As a consequence of this theorem 
we have 


THEOREM 2.3. If f (€ ***s %m) is an unbiased estimate of g(0) and if 
Ef f(X +++, X,,)} exists, then the corresponding U statistic for a sample 
of n is a consistent estimate of g(0) as n —> 2. 
n¥2(U,, — g(0)) has a limiting 


Proof. By the theorem mentioned above, 
We have 


normal distribution with mean zero and finite variance. 


Pry {g(0) — e < Un < 8(0) + £} 
= Pry {— n?e < n™?(Un — 2(0)) < +n¥?e}, 


and, since n1/2e —> co as n —> 00, this last probability expression approaches 
the value 1. Thus, as n—> ©, the probability that U,, is in any small 
neighborhood of g(0) approaches 1, and we say that U,, approaches g(9) 
in probability and write p-lim U, = g(0). 
n> 
EXAMPLE 2.2. We consider further the parameters introduced in 
» and the class of probability 


Example 2.1. There the sample space was R”, 
distributions were those of a sample of n from an absolutely continuous 
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distribution on Rt. Theorem 7.1 in Chapter 1 proved the order statistic 
complete corresponding to the uniform distributions over intervals on R+. 
Using Theorem 6.2 in Chapter 1, we obtain the completeness for the 
larger class of absolutely continuous distributions. The conditions for 
our Theorem 2.2 are fulfilled. 

The parameters we considered in Example 2.1 do not exist over the 
whole parameter space. However, they do exist for each of the uniform 
distributions over intervals. Hence, the additional requirement in our 
first note after the theorem is fulfilled. 

For the parameter E(X), it follows that the U statistic n~1Z2, is the 
unique U statistic which is an unbiased estimator of E(X). It has 
minimum variance and risk (convex loss) among the unbiased estimators of 
E,(X). 

For the parameter [E,( )]* we obtained a kernel xx, for a sample of 2. 
The corresponding U statistic is 


n(n — 1) 


By our theorem it is the only unbiased estimator that is symmetric in the 
«’s, and it has minimum variance among the unbiased estimators. 

For the parameter o we obtained the symmetric kernel (x, — 2)’. 
The corresponding U statistic is 


oh E(x; — #)*. 
By the theorem it is the only unbiased estimator that is symmetric in 
the a's, and it has minimum variance among the unbiased estimators 
of of. 
For the unbiased estimation of the cumulants R. A. Fisher proposed the 
k statistics. They are unbiased estimators, and, because they are 
symmetric, they are functions of the order statistic. From Theorem 2.2 


it follows that they are the minimum-variance and risk (convex-loss) 
estimators of the cumulants. 


As a measure of the concentration or of the spread of a distribution, the 
parameter A, has been used: 


Ay = E,{| X, — X|} 


= i |a — %,|dF,(x,) dF,(x,), 


where F; is the distribution function for the parameter value 0. Obviously 
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the parameter is estimable and has a kernel |x, — 2» |- The corresponding 
U statistic is 


l 
d= AA Z5 2, — x,| 


and is called Gini’s mean difference [13]. It has minimum variance among 
unbiased estimators of Aj. 

The probability measure of a set is also an estimable parameter; 
consider P,(A), where A is a set on the real line. Let $4(z) be the 
characteristic function for A: 


gy) =1 if sed 


=0 GA. 
Then 
Ex{p.(X)} = P(A), 


and the parameter is estimable of degree 1. The minimum-variance 
unbiased estimator for a sample of n is mjn, where m is the number of 2’s 
in the set A. 


EXAMPLE 2.3. Fora sample of n from an arbitrary discrete distribution 
on the real line, we have from Theorem 7.3 in Chapter | that the order 
Statistic is complete. We can then apply our Theorem 2.2, and all the 
results in the example above apply also to the discrete distributions on 
the real line. 


EXAMPLE 2.4, Let X = (X; +++, Xn), where the Y; are independent 
and each has the same distribution Po, where {P,| 9 e Q} are the absolutely 
continuous distributions on R'. Similarly, let ¥ = ( Y}, ` * ', Yn,), Where 
the Y, are independent and each has the same distribution Pry where 
{P,, |) € Q} are the absolutely continuous distributions. We consider the 
combined experiment with random variable (X, Y) and (0, ) € O°. 

For the Y’s we know that 4,(x7) = {tr's Xn j is a sufficient and 
complete statistic. Similarly f2(%,***: Yn.) = fej, iy Yngh isa complete 
Sufficient statistic for the Y’s. By Theorem 5.3 in Chapter | the combined 
Statistic (fa, +++, en h Ya Yngh) 1S sufficient for the combined experi- 
ment. Theorem 6.3 in Chapter 1 is not easily applied to show the 
Combined statistic complete. However, Theorem 7.1 of Chapter 1 can 
be extended in a straightforward manner and gives this result. , 

Then directly from the Lehmann-Scheffé Theorem 2.8 in Chapter 2 it 
follows that, for any estimable parameter, there is essentially only one 
unbiased estimate based on ({ťp ** + %n,} (is * ° °> Yng})+_ It is interesting 
to note that a function of this statistic is symmetric in the x’s and symmetric 


in the y’s, and conversely. 
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Consider the estimation of E(XY). Obviously, an unbiased estimator 
is ziy,- By averaging we make x,y, symmetric in x’s and obtain 


ns 


similarly, making it symmetric in the y’s, we obtain 


1 x 
nz 


This is the only unbiased estimator symmetric in the 2’s and in the y’s, and 
it has minimum variance. 

By a similar argument we can show that the minimum variance unbiased 
estimator of o% y is 


na 


ni 
x, — 7)? Y; — 9)? 
> ) A > U:— g) 

n—1 n= ` 

EXAMPLE 2.5. Let (Xj, Y1), +, (Xm Yn) be independent and each 
(Yi Y;) have the same distribution P, over R?, where {P,|0 e Q} is the 
class of absolutely continuous distributions. The order statistic is 
{@, %4),** +5 (ns Yn)} and is of course a sufficient statistic. Any function 
of the order statistic is a symmetric function of the n entries (a, Y1) °°" 
(Ens Yn). By Theorem 7.3 in Chapter 1, it is complete. 

For the estimation of the parameter E(X Y) we have an unbiased estimate 
%,y,. The corresponding U statistic is 


and it is essentially unique and has minimum variance and risk (convex 
loss). 


For the estimation of oj.) we have by Example 2.2 the unbiased 
estimator 


ale + y1 — (e + ¥o)]?. 
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The corresponding U statistic is 


I — 
eo (x; + ys — t+ 9)? 


where w -+y =nX(x;-+y,), and it has minimum variance among 
unbiased estimators. 


3. TOLERANCE REGIONS 


ebe =A, Xt X Hp, and let the class 
|9eQ} correspond to a sample of n from a 
measure in {P,|0 € Q} over the component space X = X, InSection 5 
of Chapter 2 we introduced the concept of a statistical tolerance region 
and gave several definitions. Each of these definitions has applications, 
but the one of particular importance is Definition 5.2 for a distribution- 
free tolerance region. Such a region has regularity properties which make 
it particularly attractive in nonparametric theory. Also distribution-free 
tolerance regions give quite general examples of the other types of region. 


For convenience we repeat the definition: 
S(,, +++, 2,) is a distribution-free tolerance region for {P,|0€Q} 
over X(L) if S(ay,***,%,) takes values in J and if the induced 


distribution of the function 
Po(S(a, ***» Tn) 


over X” is independent of the para- 


Again let the sample spac 
of probability measures {P} 


corresponding to the measure Px 

meter 0 EQ. 

Later in this section we shall have a 
constructing distribution-free tolerance regions, 
Necessary distribution theory. 


Let f (2) be a real-valued statistic defined over the component space F. 
., ,) from Æ we can calculate the value of the 


For an outcome x = (t; ***> 0 
statistic for each coordinate and obtain n real numbers: f (81) > f @n). 
f these n numbers, and 


The first theorem is concerned with the rth largest o 
n 
Let max(r) t; designate the rth largest 


i=1 
The theorem is concerned with the 


-valued statistic max(r) f(%,); 
sample given the value for 


theorem which gives a method for 
but first we develop some 


We need a symbol to designate it. 


of the n real numbers fy, °* "> !n- 
conditional distribution over &, given the real 
that is, loosely, the conditional distribution ofa 
the rth largest f(x;) from the sample. 
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THEOREM 3.1. If the distribution of the real statistic f (x) induced from 


P, over & is continuous, then, for a sample of n, X = (X4, °° +, X,), from 
n 


P,, the conditional distribution, given max(r) f(x,) = ¢t is that ofa sample 
i=l 

of r—1 from P; restricted to {x| f(x) > t} an independent sample of 

n — r from P, restricted to {x|f@) < t} and an independent sample of one 


from P,(|f(X) =). 


Note. For an extension of this theorem to cover the case where the 
induced distribution of f (x) is discontinuous see [12]. 


Proof. When the nx’s in the outcome are ordered by the real statistic 
J (2), there are n! possible orderings. Each of these orderings has the 
same conditional probability 1/n!, given the order statistic (a, ++ *, %,) = 
{titt n} This obtains immediately from formula (2.6) in the previous 
section. The statement of the theorem omits to mention this “equal 
likely” conditional distribution, given the order statistic, and just gives the 
distribution of the r — 1 «’s having f(x) > t, then — r2’s having f(%)<¢, 
and one x having f(x) = t. Actually, there are n!/(r — 1)!(n — r)! such 
arrangements of the original outcome into this partition, and each has the 
same probability. 

We have assumed that the induced distribution of the statistic f(x) is 
continuous. Hence, for a sample of n from P, over 2’, the probability is 
zero that any of the values f (x), ++ +, f (&„) are equal. We therefore omit 
from further consideration the outcomes (æ, ++, ,„) for which any of 
S (1), ++, f (@,) are equal. 

From the symmetry of the probability measure over ¥ = Z”, each of 
the n!/(r — 1)!1!(n—r)! cases with r—1, 1, n—r xs having f(x); 
respectively, >, = , < ¢ will have the same probability distribution. We 
therefore consider one of these, say the one for which f (z); +++, f(t) > 4 
S@,)=t and fea) <t. To find the conditional dis- 
tribution, given this condition, we need three simple properties of condi- 
tional probabilities. First, if a condition has positive probability of being 
fulfilled, then the conditional measure is the given measure normalized to 
the region for which the condition is fulfilled, [see first definition of 
conditional probability (4.6) in Chapter 1]; for example, if Py(A) is the 
measure over Z, then by this measure normalized for the condition 
elw > t} we mean 


, _ PxlAn fef @) > 4] 
(3.1) et ea = 
xfA[f@ > 9 Pall ®© > 


Second, a conditional probability measure can be obtained in steps. For, 
if Px(4| CC) is the conditional measure from P(A), given conditions 
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Cı and C, it is equal Pxic (A| Ca), the conditional measure from 
Px(A| C), given condition C,. This holds for both definitions of con- 
ditional probability, provided the conditional probability is a measure 
(see Problem 7). Third, if Y and Z are independent random variables and 
the conditional measure of Y is wanted, given a condition on Z, it is just 
the given unconditioned measure for Y. This is immediately obtained 
from the fact that the measure over Y% x 2% is a product measure. 

We now apply the three properties to finding our conditional probability. 
We first impose the single condition f (v,) =? on the product measure 
over & = Z”. The conditional measure is that of a sample of n — 1 
from P, over 2, giving the coordinates other than x, and an independent 
sample of one from P,(A FAE) = t) for the coordinate x,. We now impose 
the additional conditions f (2), + © °, f (®,—1) > t and f (8741) Sf En) < t 
The resultant conditional measure is that of a sample of r — 1 from Py 
restricted to fel (x) > t}, giving the first — 1 coordinates, an independent 
Sample of one from P,(A | f(@) = 1), and an independent sample of n — r 
from P,(A) restricted to {| f(z) <t}. This proves the theorem. 


We now consider some probability distributions derived from sampling 
from the uniform distribution on the interval [0,1]. Let U designate a 


real random variable with the uniform distribution on [0, 1]. It will have 
the density function f (u) defined by 
w= if wed, i[ 
(3.2) Se) 
=0 € )0, 1[. 
For a sample of n we have the random variable (U;, ee U,,), where the 
U’s are independent and each has the above distribution. (U,,-°**, U,) 


has the uniform distribution over [0, 1]”. ; . 
The n-dimensional cube ]0, 1[” can be partitioned into n! regions, a 


typical one being {(t, ** *, Un) |0 < ti, << i, < 1}, and a region in 
which at least two of the coordinates u4, ***, u, are equal. This last set 
has Lebesgue measure zero and hence probability zero. Letting uap °°", 
Urn) designate the smallest, +, the largest of the numbers t4, * * *, Un and 
using these as coordinates in each of the n! regions, we find the 7! regions 
to be the same; viz., {(uays ° 5 Mow) [0 < May <0 < Uim < 1}. Also 
the density function has the same value over each. It follows then that 
the induced distribution of (Uu): **'s Uw) has density function llit: 3 
Un) given by 

A(uy, ++ *, u,) =n! 


=0 otherwise; 


if O<u<e <u, <1, 
(3.3) 


it is the uniform dis tribution over the region given byO < 14 < ++ <u, < 1. 
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We define new variables c,,---,¢,,, by the equations 
Cr = üm 


C2 = U2) — Un 


Cn = Un) — Un—1y 
Caz = l — ttn): 
These are the successive differences between the values 0, tg), ** *, Utm» 1, 
n+1 


and they obviously satisfy the relation > c;=1. The c’s are sometimes 
1 

called coverages; for example, the value of c; = Ugy — uu is just the 

probability measure (or coverage) of the interval [u;;_4), Ug] as given by the 

uniform distribution. 

We now find the induced distribution of (c4, ***, C41). If we choose 
any n of the c’s and consider the transformation from the uus, the 
Jacobian will have absolute value one. For consider the first n c’s; we 
have the Jacobian 


1 —1 0 

0 1 —-l 

0 0 1 

Olen re k 
Pluan s Uim) 

1 —1 0 
0 1 —l1 
0 0 1 


=f, 
and similarly for any other n c's. Thus, the distribution of (uat utn) 
as given by 


P(A) = fn! duny ***, dln) 


with A c {(up tt, up) |0 <u t < un <1} yields for cy ***, Cn the 
induced distribution 


(3.4) P(B) = i ni de, ++, de,, 
B 
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where B c {(c,,°*°, c,)|0 <c,<1,Ec;< 1}. This distribution of the 
esis uniform and has complete symmetry with respect to the 7 c’s involved. 
The same is true for any other set of n c’s. It follows then that the 
distribution of the n + 1 œs is uniform over the region 

n+1 


Kleiss Cm) |0 < Cis ` c; = 1}. 
I 


We now consider the induced distribution of the sum of r ES: i Let such 

a sum be designated by C, Because of the symmetry of the distribution 

of the c’s, the distribution of C, is the same as the distribution of the sum 
E 


Of the first r c’s, x c; and from the definition of the c’s we have 


1 
r 


C= >. Ci = Min 
a 
The distribution function of u) is given by 
6.5) Pr {Um Sy} =! | duny’ `’ din 


Jln SY 


i fe ; PA du, ” du 
=n! dm asr du, du, cai | ae 
0 0 Jy tr 


f u Uu du, 
Jo@@—D! m-a)! 
= T(n + 1) le wl — u)” du. 
T(r) (vn — r+ 1) Jo en 1: 
Thus C, has the £ distribution with parameters p =" andg=n—r + l; 
the 2 distribution function is 

rip +9 f” up — wt du, 
IPD = TT Jo 


=n! 


is easil 
the so-called incomplete f function. The ae hee A E 
Obtained from the distribution function 6-5) 4 fOr Cis Omer 
immediately from the symmetry of the distributi 
€ have 


1 
=(n + 1E(Cy> 
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from which we obtain E(c;) = (n + 1)-!; therefore 
E(C,) =r: Elt) 
(3.6) a 
hn +15 
We consider one more probability distribution. Let X be a real-valued 
random variable with a continuous distribution function F(a). We 
prove that the induced distribution of F(X) is the uniform distribution 


on the interval [0, 1]. 
For any v define v— and x+ by 


s—= inf yv 
F(z')=F(2) 

«+ = sup 2; 
F(x’) = F(x) 


they are respectively the inf and sup of the x’ having F(x’) = F(x). 
Because of the continuity of the function F(x), F(e—) = F(x) = F(«-+). 
From the definition of «—, x+, we have the set-inclusion relations 


{e" |x’ <a—}ec {x" | F(a’) < F(x)}.c fe" |e’ < a+}. 
Consequently we have 
Pr {X <2—} < Pr (F(X) < F(w)} < Pr {X < +}. 
Since the outside expressions are each equal to F(x), then 
Pr {F(X) < F(2)} = F(a); 
but, since F(x) takes all values on ]0, 1[, we have 
(3.7) Pr (F(X) <y} =y 


for y € ]0, I[. Thus F(X) has the uniform distribution on [0, 1]. 

We now define a construction procedure whereby, corresponding to 7 
points 2,+++,, in 2, the region Z is partitioned into n + 1 disjoint 
subsets (numbered from 1 to n -+ 1) and referred to as blocks, and a 
further n disjoint subsets referred to as cuts. The theorem at the end of 
this section will prove that a distribution-free tolerance region can be 
formed by taking any r of the numbered blocks. 

The construction procedure is given by two sequences of functions 
Ie) = Pa] and [pr pa] B = Piles ryt, rja) is a real- 
valued measurable function of x which may depend on j — 1 real variables 
"°°", fja When the function 4; is used, the real values Ty 2, Hj BE 
values, respectively, of the function Pig Dyers Ps = Pls t ts Mya) ÍS 
an integer in 1,+--,, and which integer it is may depend on j — 1 real 
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variables r}, * * +, Fj—1, Which once again will be values attained, respectively, 
by the functions ¢,,°-*,¢;.. Further, for a given sequence of numbers 
Fott, ly, the set of integers (pı, ***, pn) is a permutation of the set 
{1,-++,); thus there can be no two integers the same in the sequence. 

For the construction procedure it is convenient to think of the outcome 
(a1, +++, @,) as n points &y, * * *, x, in the space Z. Now, corresponding to 
{#,(x)} and {p;} and n points 2, * * +, x, in Z, we construct n + 1 disjoint 
blocks by making n cuts or divisions of Z. The first is made by using the 
function ¢,(2) and the integer pı The n values ¢4(%), +++, ¢1(%,) are 


n 
examined and the p, largest value chosen, max(pı) fox). Z is then 
ist 
divided into two regions, 
n 
Sin, = (eldat > max) g) 
and 


S, 


p. 


piati = feae < max(p,) ge) 
by means of the cut 
T, = [oldie = marp) h): 


There will, of course, be at least one 2; having £(«,) = max(p,) (2), 
and, when we apply our construction procedure, there will with probability 
one be only one such x. We therefore assume that there is exactly one x; 
in Tp; It follows then that there are exactly pı — 1 xs in Sy..p, and 
exactly n — p, in S, EN E 

Theseo aot cade by using the function ¢,(x) and the integer po. 
These, however, depend on r}, which stands for the value of the function 


$,(%) at the cut T,, : rı = max(p;) (x). Hence 


h(x) = palv; max(p) CACA 


pa = pamax(p) $302) 


If p is one of the integers 1, *** Pı — 1, then we divide the region Stop, 
into S,...,, and Sp,+1-p, and the cut Tp, This is done by calculating 
the py largest value of a(x) for a; in Srp and dividing Sj\...), into 
Si-m = Stop, N El) > ae o(2r,)} 
and 
S, 


Pat 1P 


Si, 0 felg) <  max(p2) o(x,)} 
1 2S 1--py 
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by means of the cut 
T= Sing, n {e| bo(z) = max(pə) do(%;)}- 


ZES}...p, 
If py is one of the integers pı + 1,---,n, then we divide the region 
Spytien41 into Sp, tip, aNd S,, 4 1n41 and the cut Tp, This is done by 
calculating the pa — pı largest? 2(x;) for x; in Sp +1-n+1 and dividing 
Sp,+1en+1 into 


Sp, trom = Spron N felg > max(pa — pr) b2(x,)} 


TESpy stent 


and 


Syeptengy = Spins N feld) << max(po — pi) do(x,)} 


eS ighas 


by means of the cut 


Tp, = Spiti N {x| do(x) = max(p, — pı) $(x:)}- 
Tp ESpy41-n41 

This procedure is continued. For the jth stage ¢,(«) and p; are used 
with ry, +++, Fj replaced by the values of ¢,, +, $j- at their respective 
cuts. The set divided is the one having p; in its index set, and the æ; in that 
set are ordered by 4,(x) and the p; — P;, largest value chosen to represent 
the cut. p; stands for the largest of the p,, + +, p; which are less than 
P3; itis one less than the smallest index in the set being divided. The set 
is then divided into a set having ¢,(x) values greater than the value repre- 
senting the cut, a set having ¢,(x) values less than the value representing 
the cut, and the cut Toy 

By this procedure we obtain finally n + 1 regions S}, ***, S,,,, and ” 
cuts T,,°**,7,. Then, provided the conditions of the theorem are 
fulfilled, we can choose any r of these n + 1 regions to form a tolerance 
region, and its coverage will have the simple distribution of Ug), the rth 
order statistic in a sample of n from the uniform distribution [0, 1]. An 
example illustrating this procedure is given after the theorem. 

THEOREM 3.2. If ¢,(x),+-+,,(x) each have a continuous induced 
distribution corresponding to the measure P over Z, then the coverages 
P(S,),***;P(Sn41) of the n+ 1 regions Sotte, Spy defined by the 
procedure above in terms of a sample X}, -+ +, X, from P have the uniform 
distribution of coverages (3.4) for a sample from the uniform distribution 
(0, 1]. Any r of the regions will have a coverage with the ĝ distribution 
having parameters p =r,g=n—r-+1. 


+ If we treat the p,;’s in Typ, aS being “largest” and then order the 2;’s in Tp, tien 
by means of the function $2(z,), then the p, “largest” æ; is used to form the cut. 
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The requirement of continuity of the induced distribution of the ¢’s may 
be modified to hold almost everywhere (P) with respect to the auxiliary 
arguments of the ¢ functions. 


Proof. The proof is done stepwise, corresponding to the stages in the 
constructional procedure, and a correspondence is set up with the simple 
case of sampling from the uniform distribution. For this latter case the 
distributions were obtained earlier in this section, and they are as given in 
the statement of the theorem. 

Consider the distribution of $,(X) corresponding to P over X. Ithas 
a continuous distribution function, say F(y). To simplify the proof we 
assume that F(y) is a strictly increasing function; it is a matter of detail to 
remove this restriction. By the last probability result at the beginning of 
this section, we know that F(¢,(X)) has as induced distribution the uniform 
distribution [0,1]. Then, for a sample of n from P, we have that 
F($,(X)), +++, F(X) are independent, each with the uniform distribu- 
tion [0, 1]. Therefore max(p;) {F(X °° *> F(¢,(X,,))} has the distri- 
bution of the p,th largest order statistic Uj,—p,+1) from the uniform 
distribution. But, because of the strict monotonicity of F(y), we have 


max(p;) F(p,(x;)) = F(max(p,) gE). 


Then, from the definition of F(y), we obtain 
P(Si-p) = 1 — F(max(py) $1), 
P(Sp,+1--n+0) = F(max(pı) $y (%;))- 


Because of these relations, the distribution of the coverages of Si., and 


Spy 41n41 is the same as the distribution of the coverages of [uin—p, +01] 
and [0, n-p, +1)] in sampling from the uniform distribution. Now from 
Theorem (3.1) the conditional distributions given these coverages, that is, 

y= is for the first case that of 


given F(max(p,) $(x,)) = t and un-p+1 : . 
samples of p, — 1 and n — p, from P restricted, respectively, to Sy...p, and 


to S, and for the uniform case is that of samples of p, — 1 and 


Pytlen+1> ; A 
) distribution restricted to lin-p, +» 1] and 


n— p, from the uniform jo ; $ 
lo, Utn—p, +1] Also, for both the original case and the uniform case the 


coverages calculated using the restricted distribution to the coverages 
calculated using the original distribution will be in the ratio 1 — ż to 1 for 
the sample of p, — 1, and rto 1 for the sample of n — pı The coverages 
Of the sets S.p» Sp, +1=n+1 have been related to corresponding coverages 
for the uniform distribution, and for each set the 2’s act as a sample from 
the restricted original distribution and correspond to w’s from the restricted 


uniform distribution. 
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The above argument not only derived the necessary results for the first 
stage, but gave the additional argument necessary to show that for the 
second stage the proof will be a duplicate of the first-stage proof, only 
applied to the reduced space S}...,, and [un-p+1 1] Or Sp.4t-n41 and 
[0, %,—p+1)] as the case may be. 

After the final stage we obtain P(S,),---, P(S,,3) with the same 
distribution as c,,1,°**, & for the uniform distribution. This completes 
the proof. 


Note. The theorem can be generalized with little change in proof to 
permit the construction functions at any stage to depend on the value of 
the «’s at previous cuts and not just on the values of the ¢ functions at 
those cuts (J. H. B. Kemperman). 

A treatment of the discontinuous case is given in [6], [8], [9] and [10]. 

Distribution-free tolerance regions can be used to form f tolerance 
regions for a proportion p; for convenience we give again Definition 5.1 
of Chapter 3: 

S(t, +++, &,) is a B tolerance region for a proportion p if 


i pre — 
bas Pro {Py(S(%4, »X,)) =p} =f. 


If S(w,, +++, ”,) is a distribution-free tolerance region, then we have that 
P,(S(Xq,° ++, X,)) has a distribution on [0, 1] which is independent of 0. 
Therefore, S(x}, * + +, w,,)can be treated asa tolerance region fora proportion 
p if B is given by 
B = Pro {P(S(%4, ++, X,)) > p}. 

Now for distribution-free tolerance regions obtained from Theorem 3.2, 
R. B. Murphy [11] has used the f distribution to construct graphs 
connecting f, p, n, r. 

EXAMPLE 3.1. A sample of 59 observations is made from a continuous 
bivariate distribution known to have two modes; a 50% tolerance region 
in two parts centering on the two modes is desired. From Murphy’s 
graphs [11] a region formed from 36 blockst is seen to have a 90% 
probability of containing at least 50% of the population; that is, 90% 
confidence that the region contains at least 50 % of the population. The 
following procedure is proposed as a solution to obtaining the required 
region. 

The 59 points are plotted in Fig. 16. The function y is used to remove 
two blocks by the cut c,; two further blocks using the function —y are 
removed by the cut c}. Similarly æ and —x are used to form cuts c, and 
cs. The rectangle so formed now corresponds to 52 blocks. 


; + It is worth noticing that only 60% of the equivalent blocks yields 90% confidence 
in at least 50% of the population. 
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The rectangle is tentatively cut into eight sections formed by the two 
diagonals and the two lines through the center parallel to the v and y axis. 
For convenience, number these sections from 1 to 8 clockwise starting at 
top center. In the first section, cut off one block from the outside, using a 


Figure 16. 


f —224° to the x axis; that is, use the function 
For the second section, use the function 
y the cut Co. Apply a similar 


line making an angle o 
Y + x tan 224° to form the cut Cy. 
Y + æ tan 674° to remove one block b 
procedure to each of the 6 remaining sections, thus forming cuts C11, Cys, 
Cig, Cias Cyg, aNd cje The region now remaining corresponds to 44 blocks. 

The 8 sections originally were of equal area. Each section has had a 
block removed, thus reducing the areas to the values ay, ° +) ss say. 
Further cutting will depend on these areas, they being an indication of the 
relative positions of the two modes. Consider the total area of an 
adjacent pair of reduced sections and of the opposite pair; for example, 
total area equals a, + 42 + 45 + aş. Do this for each of the four possible 
selections. From the diagram it is easily seen that the group with 
minimum total area corresponds to Sections 3, 4, 7, 8. These are the 
Sections that presumably tend to separate the two modes; hence Sa 
the remaining region by a line with slope —!- If the blocks ha been 
2, 3, 6, 7, we would have used a line with slope 0. The reasoning behind 


this procedure is quite straightforward. 
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Using the function y + x, we divide the 44-block region into parts 
corresponding to 22 blocks and 22 blocks; that is, we choose the point 
giving the 22d largest value to the function y + x and make the cut cgg. 
The two regions formed by this cut are further reduced with the objective 
being to form two circular regions each corresponding to 18 blocks. 

Use the function (y — 7)? + (x — £)? to remove four blocks from the 
right-hand region. As center of the circle (, 7) a reasonable choice 
would be the center, marked x, of the largest circle that can be inscribed in 
that region. Cuts C39, Cyo, C41» and Czo are made by this function. We 
apply a similar procedure to the left-hand region. 

The resultant two circles form a region T composed of 36 blocks and 
hence in repeated sampling have 90% confidence of being at least a 50% 
tolerance region. It should be noted that the two parts of T will not 
always be circular; they will be circular with perhaps indentations. See, 
for example, cut C4.) 


4. PROBLEMS FOR SOLUTION 


1. By constructing a class of probability distributions contained in the absolutely 
continuous distributions on the real line, show that there does not exist an unbiased 
estimate of the median for any sample size. 

2. By constructing a simple type of probability distribution, show that y2 is not of 
degree one with respect to the class of absolutely continuous distributions. 

3. For the class of absolutely continuous distributions on the real line Rt, what are 
the degrees of the parameters 


E(X*), E(X), EXLX — E(X)}}, EXX)? 
Find the minimum variance unbiased estimators. 


4. For the class of absolutely continuous distributions over R?, what are the degrees 
of the parameters 


EX), 0%, E(X Xa), akp E*(X,)E(X2)? 
of the vector parameters 


E(X}) ok, 
E(X, X) Cov {X, Xa} 
E(X) oR, 3 
Find the minimum variance unbiased estimators. 
5. If (ua), ***, Um) is the order statistic for a sample of n from the uniform distri- 


bution [0, 1], find the marginal distribution of (u), Utro). 
6. Find the joint distribution of (P(U S), P(U S) where R, S are disjoint sets of r, $ 
R S 


integers chosen from 1, +++, n + 1, and where S,, + +, Sny are distribution-free toler- 
ance regions as given by Theorem 3.2. 

7. If fia), fa() are statistics over Z(s/) and P(A) is a measure on %(.0/), show that 
the conditional measure P(A|(f,, fa)) is equal Px 4,(A |f2), the conditional measure 
given fz, corresponding to P(A|f,). Assume the conditional probabilities are measures. 
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CHAPTER 5 


The Theory of Hypothesis Testing 


1. UNBIASEDNESS 


Unbiasedness was defined in Section 3.6 of Chapter 2. Roughly, a test 
is unbiased if the probability of accepting the alternative is larger when the 
alternative is true than when the hypothesis is true. This is a reasonable 
property to require of atest. For nonparametric theory a general method 
of constructing unbiased tests would be of considerable value. However, 
we have only a technique which is capable, for some of the simpler prob- 
lems, of producing an unbiased test. Also, for two of the simpler 
problems we have a criterion which can be used to check for unbiasedness. 
First we consider this criterion as it applies to the two-sample problem. 

The two-sample problem was described by (3.1) in Chapter 3. For this 
let Apena be independent with distribution function F(x) and Xm+i 

++, Xny+n be independent with distribution function G(æ) where F and G 
are assumed continuous. The hypothesis is that F(x) = G(«) for all x. 
We consider a one-sided alternative for which the second sample values 


tend to be larger than first sample values. Precisely, for the alternative 
we assume that 


(1.1) F(x) > G(x) 


for all x. This means that 


Pry {Xy <2} > Pre {Xia <2} 
or 


Prp {X,>2}<Prg {X41 > 2} 
forall; insucha case we say that the random variable Xa +148 stochastically 
larger than the random variable Xi- The following criterion was given by 
Lehmann [1]. 
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THEOREM 1.1. (LEHMANN). If the test function satisfies (x, <+, 
Try wba a Hig) = A(x, adi Fe) Tryp we Tain) whenever af = Ti 
(i =n, + l, * +, 7, +), then the power function satisfies P¿(F, G) > 
P.(F, F) for F, G satisfying (1.1). If, in addition ¢ is a similar test, then d 
is unbiased against the alternative given by (1.1). 


Note. The criterion says that, if one outcome (2, © * +, Epp X aie a7, 

eng) is more “extreme” than another (2y,°**, na +n) then the test 

function for the first is at least as large as for the second, that is, more 
likely to reject the hypothesis. 


Proof. First we define a monotone function f(x) so that, if X is a 
random variable with distribution function F(x), then f(x) is a random 
variable with distribution function G(x). Let G-(u) be the inverse 
function to the continuous function G(x). If G(x) is strictly increasing, 
then G-! is well defined. Otherwise, there will be values for u such that 
G- is multivalued. For each of these u we make G~! single-valued by 
choosing one of the possible values. Then we define a function f (x) by 
the equation 
(1.2) fle) = GFE). 

This f (x) satisfies the required property; for simplicity we give the proof 
when G(z) is strictly increasing. We must prove that f (X) = G-(F(X)) 
has the distribution function G(x). By (3.7) in Chapter 4, F(X) has the 
uniform distribution [0,1]. Designating by U a random variable with 
this uniform distribution, we consider G-'(U): 

Pr {G-(U) < x} = Pr {U < G(a)} 
= iG(2); 

Thus G-1(U) has the distribution function G(x), and the function f (x) has 
the desired property. 

From (1.2) we obtain 
(1.3) G(f (a) = Fe), 
and this with (1.1) implies that f (x) > v for all x. 

We now use the function f(x) to establish the power-function in- 
equality. Let Erg designate an expectation taken with random variables 
Kia", X„, having distribution function F(x) and random variables 
Ka ADe Aii having distribution function G(x). We have 


PAF, G) = Erc Xn: o Knp Knas Xnytng)} 
= Epp (Xp * +s Xap f aga oS Kaan} 
> Erri Xr > Xn Xna ai e RTR) 
= P,(F, F). 
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The inequality in the third line follows from the ¢-function inequality in 
the statement of the theorem and from f(x) > x. 
If the test function is similar of size g, then 
PAE F)=« 
for all F, and then it follows that 
PAF, G) > a. 
Therefore the test is unbiased for the alternative given by (1.1). This 
completes the proof. 

The above theorem can also be used to extend the hypothesis for the 
two-sample problem. Let ¢4(z,,°-°, Tnn) be a test function which 
satisfies the criterion and which is similar of size æ when F(x) = G(x). The 
theorem states that ¢ is unbiased for 
a4) Hypothesis: F(x) = G(x) for alla, F continuous, 

4 

Alternative: F(x) > G(x) for all x, F, G continuous. 
Consider two distribution functions satisfying F(x) < G(x) for all. We 


can define an f (x) as in the first part of the proof; it will satisfy f (x) < æ 
for all x. Then, following the pattern of the remainder of the theorem, 


we obtain 
PF, F) > PF, G). 
But P,(F, F)=«. Hence ¢ is a size-z test for the problem: 
as) Hypothesis: F(x) < G(x) for alla, F, G continuous, 
Alternative: F(x) > G(x) for alla, F, G continuous, 
and by the theorem it is unbiased. 


EXAMPLE 1.1. 
Whitney (Wilcoxo 
the statistic 


We illustrate the theorem by the well known Mann- 
n) test for the two-sample problem. The test is based on 


1 : 
Y= TA [number of pairs (x, Ea) With x, < tapi li =1, -4m 


[=a 
and has 4 m)] 
Hst i 
a6) $r) if Far 
= <M. 


The number V, is chosen to give the test size « and depends only on m 
and ny. 
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We see if the criterion is satisfied. If we increase any of the x, +; we 
can only increase the value of V; that is, only increase the value of ¢. 
Now the value of the statistic V depends only on the permutation of the 
My + Ny NuMbETS Tays * 7s Vnpeng) which gives the numbers 2, * + +, Daang 
and, when F = G, each permuation has the same probability 1/(7; + 1»)! 
Hence the distribution of V when F = G does not depend on F, and the 
test is similar. Therefore, by Theorem 1.1 the Mann-Whitney test is 
unbiased for the problem (1.4) above. 

By the argument following the theorem, it is easily seen that the Mann- 
Whitney test is an unbiased test of size æ for the extended problem as 
given by (1.5). 


For the two-sample scale problem (3.5) in Chapter 3 we have an analog 
of Theorem 1.1 above. Consider those tests based on the differences 
among the elements of the first sample and the differences among the 
elements of the second sample; that is, consider tests which can be written 
in the form 


(1.7) Plta — Ti * s n, — Yri Bnp — Tapas s Cngtng — En, +1)- 
Now, if an increase in the differences for the second sample results in at 
most an increase in ¢, and if the test is similar for F = G, then the test is 
unbiased for the alternative of (3.5). There the second sample distribution 
was more spread out than the first sample distribution. 


We now consider a procedure which is sometimes able to produce an 
unbiased test. For this, it is essential that we have repetitions of a 
component experiment; that is, a sample from a distribution over a 
component space. First, we try to find a real parameter which effectively 
distinguishes between the hypothesis and the alternative. Let 0 be the 
parameter which indexes the probability measures in the problem, and let 
A(0) be a real parameter. We say that A(0) distinguishes between the 
hypothesis and the alternative if, on the basis of the value of A(0), we can 
say whether 0 € w (the hypothesis) or 0 € Q — o (the alternative). This 
would be the case if, for example, A(0) = Ay when 0 € w and A(0) > Ag 
when 0€Q—w. Second, we try to find an event for one, two, - 
repetitions of the component experiment which has as its probability of 
occurrence the real parameter A(0). This immediately restricts us to 
parameters A(0) which take their values in the interval [0,1]. Then, 
depending on the number of such groups of repetitions that we can form 
in the over-all experiment, we can have that number of independent 
repetitions, on each of which we observe whether or not the event occurs. 
The problem can then be treated as one in terms of the binomial distribu- 
tion. If A(0) = A, for the hypothesis and A(0) > A, for the alternative, 
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there is certainly an unbiased test—it is the usual test based on the number 
of occurrences of the event. Also, if A(@) Æ A, corresponds to the 
alternative, a two-sided unbiased binomial test can be constructed. This 
procedure was developed by Lehmann in [1]. 


EXAMPLE 1.2. Consider again the two-sample problem as given in 
Section 3.1 of Chapter 3. Let X}, +++, Xn, have a continuous distribution 
function F(x) and Xpo Xn,+n, have a continuous distribution 
function G(x). We examine the general problem 


as) Hypothesis: F(x) = G(x), 


Alternative: F(x) 4 G(x). 


As a real parameter which distinguishes between the hypothesis and 
alternative, consider 


+2 


(1.9) A(E, G) = ka LF(x) — Gaye FOF OO) 


the average squared difference with Tespect to the mean distribution 
function. Obviously, for the hypothesis, A(F, F) 


=0. By the following 
lemma we show that, for the alternative, A(F, G) 


>0. 
LEMMA 1.1 F(x) = G(x) if and only if 


(1.10) A(F, G) = i 


(F — G)? get =0. 
o 2 

Proof. Itis obvious that F(x) = G(x) implies A(F, G) =0. We there- 
fore need only prove that F(x) Æ G(x) implies ACF, G)>0. Let x, be 
such that F(x) + G(2,), and say F(x) — Glz) =d>0. But F(—oo) = 
G(—co) = Oand F, G are continuous; therefore there exists Zo < a such 
that F(x) — G(x) = d/2 and F(x) — G(x) > d/2 for To Kr <r. Since 
neither F(x) nor G(x) can decrease, one of F(x), G(x) must increase by at 
least d/2 when x goes from % tow. We ha 


ve 
Ara > [crap oe 
Ae 
“M 2, 
= 0; 


This completes the proof. 
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From the Jemma we have A(F, G) = 0, > 0, according as (F, G) does 
or does not belong to the hypothesis. A is thus a real parameter which 
distinguishes between the hypothesis and alternative. We look for an 
event having A or a linear function of A as its probability. 

Let X,, X, be independent random variables with the distribution 
function F(x), and let X/, Xz be independent random variables with the 
distribution function G(x). For convenience we designate the relationship 
max (2, 2%) < min (ti, və) by Tp % < Tti, 25. Consider the event 
X Xa < YÍ, Xz or Xj, X < Xı, Xy. Designating its probability by p, 
we shall prove that 


p=Pr{X%, X< Xi, Xo or Xi, Xz < Xyp Xo} 
= 1/3 + 2A(F, G). 
Noting for example that max (Xj, Xz) has distribution function F?(x) and 


remembering that F and G are continuous, we can write 


p =f — G} dF? +fa — F} dG? 


=2+ [aF | F? dG?) 2f Gar 2 | Fact 
=24 faro —4 | Grdr —4 | FG aG 


=3— 4| orar +0 


=3-2 |+- e- oa E 


meur MALETIN anette 


= 3 — 8/3 + 2A(F, G) 
= 1/3 + 2A(F, G). 


We thus have an event, 2, £a < ti, T3 OF tj, % < T1, Ta, based on two a’s 
from the first sample and two 2’s from the second sample, and this event 
has probability 1/3 + 2A(F, G), a parameter that distinguishes between 
the hypothesis and the alternative. This probability fakes its minimum 
value 1/3 for the hypothesis. Also, we are able to observe m = } min (74, 73) 
independent repetitions, on each of which the event can or aanak occur. 
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Each repetition is formed from two 2’s from each sample. Our problem 
is thus reduced to 


Hypothesis: p = 1/3, 
Alternative: p > 1/3, 


for the binomial distribution with parameters m, p. By Problem 26, in 
Chapter 1, the most powerful test is 


oy)=1 if y>c 


=a =c 


=0 <6, 
where y is the number of occurrences of the event and a, 
give the test exact size « for the binomial dis 
m, 1/3. Since this binomial test is unbiased, we 
test for the two-sample problem (1.8). 


The test above has the satisfying property that its power is a strictly 
increasing function of p and hence of A(F, G) which is a measure of the 
difference between the distribution functions F, G. However, from another 
point of view this test has a somewhat less Satisfactory property. We now 
discuss this. 

We have expressed the problem in terms of the parameter A(F, G). 
Also, our test can be based on a Statistic V, the Proportion of times the 
pairwise inequality occurs: 


ve 2 [ number of pairs having x; Xo <2 


c are chosen to 
tribution with parameters 
have obtained an unbiased 


nyt2is Un +2i—1 | 
m | Or &, 


mre ngaia L Tos Tog (i=1, m) 
From the properties of the binomial distribution, we have 


1 
Erc{V} = 3 + 2A(F, G); 


thus the statistic V is an unbiased estimator of the parameter 
1/3 + 2A(F, G). If we introduc 


e a counter function y(x, a; a}, 2) which 
records whether the pairwise inequality has occurred, 


(4, ta; £i 0%) = ] if x bal eee 
PX, Ta; Xj, t2) = I tita < gita or Tis To < Ti, Ta 
=0 otherwise, 
then the statistic V can be written 


2 m 


paS Bas, Bay a 3 
m£ Poi toi; Cno Cn 49; 4). 
is 
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It would seem that an improved test would be obtained if, instead of using 
the statistic V, we used the statistic having minimum variance among 
unbiased estimators of 1/3 + 2A(F, G). By the technique used in 
Example 2.4 of Chapter 4, this better estimator will be a statistic V* which 
is obtained from V by symmetrizing in the first sample ’s and also in the 
second sample a’s. We have 


1 


y* = PESEN > z Plts Ci 5 Tni Unys’ J; 
(2)(2)' 
2 


i<i’ j<j’ 
2 


V* is the proportion of cases where a first sample pair is less than or 
greater than a second sample pair. The altered test would be to reject 
the hypothesis for large values of V*: 


(Ve) =1 if V¥>ct 
= qa* = č 
= 0 Le, 


where a*, c* are chosen to give the test size « under the equal probability 
permutations of the hypothesis. Lehmann [1] has stated that this test is 
no longer unbiased. Nevertheless, the properties that V* is of minimum 
variance and has a limiting normal distribution indicate that this is a good 
test for the two-sample problem (1.8). 


2. MOST POWERFUL TESTS 


2.1. Uniformly Most Powerful Tests. Even in parametric theory it is 
only the simplest problems that admit a uniformly most powerful test. In 
this section we consider one such nonparametric problem—the problem of 
location. 

The problem of location was described in Section 2.2 of Chapter 3. Let 
X,, +++, X,, be independent and each have the same absolutely continuous 
distribution on the real line with density function f(x); the problem is 


Hypothesis: E (AE) = Eo 
Alternative: EAE) > £o 


where &,( f,(x)) is the p percentile of the distribution as given by f(x). For 
notational simplicity we assume &) = 0. 
Let f(x) be a typical density of the alternative. We write 


(2.2) SO) = PFO) + FF.) 


(2.1) 
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iti i is, the positive axis, 

(æ), f(x) are positive only on the negative axis, e | 
i gas p* is the probability to the left of the origin and q” the 
probability to the right of the origin. (See Fig. 17.) Because of our choice 
of p* and q*, f(x) and f,(x)will be density functions with probability only 


pf (x) af, (x) 


x 


Figure 17, Diagrams of f(x) (top) and f(x) (bottom), 


on the negative and the positive axes, res 
&,( f(x) > 0, if follows that P* <p an 
P =p*, then £,(/(x)) =0, and JŒ) can be considered a density of the 
hypothesis. By assuming that f(z) is Strictly in the alternative, we obtain 
P* <p. We now calculate the test that is most powerful for this density 
JŒ) of the alternative, 

Following the method 
hypothesis distribution w 
difficult to test or disting 


pectively. Since by assumption 
d q*>q where q =1 — p. If 


used in Section 3.3 of Chapter 2, we look for a 
hich most resembles f(x), which would be most 
uish from f(x). Consider the density 


(2.3) Ji) = p f(z) + q f(x), 


Obviously &,(f,(z)) =0, and So(2) belongs to the hypothesis. Also, it 
resembles f(x) in that the relative densities on the negative axis and on the 
Positive axes are the same as for f (x). 


For a sample of'n, we find the most 
powerful test of the simple hypothesis that the density is Jo(#) against the 
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simple alternative that it is f(x); applying the fundamental lemma, 
Theorem 3.1 in Chapter 2, we obtain 


TI f@) 

ap nael if =+—>c 
Thx) 

st | =c 

=0 Cs 


Defining i(x,, * * *, y) as the number of positive numbers %, * * *, % 
write the probability ratio 


we can 


n 


[1 fed TIPSE + PAE 
i=1 is 

[T fom) I I [pf-(@) + qf) 
j=1 i= 


_ ao (m 
P: q 


es jae (£ i 
P*q P 


Since pq*/p*q is greater than 1, the probability ratio is a monotone- 
increasing function of i(®;, ** *, %,)3; hence the test function can be written 


(2.4) A(X, °° ey) = 1 if itan) >e 
= =c 
=0 <i; 


where a, c are chosen to give the test exact size «. Under the hypothesis 
the induced distribution of i(2,, * + *, 2,) is binomial with parameters n, q; 
hence c is the « point from the right-hand tail of this distribution. This 
test is based on the signs of the x;’s and is usually referred to as the sign test. 

The test is similar for the full hypothesis and hence is a size-« test. 
By the theory in Chapter 2 it is a most powerful test against f(x). But the 
test does not depend on the form of f (x). Therefore, itis a uniformly most 
powerful test for the problem (2.1). More generally, the sign test is based 
on the number of positive signs among 2, — Ep ia — Ep: 
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The sign test remains uniformly most powerful for the more general 
problem (2.5) in Chapter 3: 


Hypothesis: &,(f,(x)) < £o 
Alternative: &,(f,(2)) > £o- 


The hypothesis has been enlarged; the alternative is the same. With 
respect to this enlarged hypothesis, the sign test is easily seen to remain of 
size æ. For, given any one of the added distributions, it will have prob- 
ability less than q to the right of the origin. Hence the probability for 
large values of i(x}, ** +, %,) is reduced; that is, the power function takes a 
value less than «. Then, by the theory of Section 3.3 in Chapter 2, the 
test remains uniformly most powerful for (2.5). 

We mention another problem which has a uniformly most powerful test. 
Let Xj, ---, Xa be independent and each have the same distribution on R2 
with density f(a), x2); @ indexes the absolutely continuous distributions. 


Letting G,(z) stand for the induced distribution of Z = Xx) — x), we 
consider briefly the problem 


(2.5) 


02.6) Hypothesis: £).(G,) < 0, 
É Alternative: £; 5(G,) > 0, 


This problem is concerned with the median of the random variable 
Z=X) — ¥ where ¥ = (X¥®, X) has the density f(x, y). 

This problem could arise when two, say, treatments were applied to two 
“plots” in a block and a series of repetition of this basic experiment were 
made. The hypothesis and alternative (2.6) represent one attempt to 
describe treatment 2 being, respectively, no better and better than 
treatment 1. 

By following the same methods, it is possible to show that the sign test 
based on the differences aD aD, e x?) — aD ig uniformly most 
powerful. 


> For the problems above 
as defined at the end of Secti 
problem. We can represent 


(2.7) 


it is possible to find a generalized sufficient statistic 


on S in Chapter 1. For example, consider the first 
any density function f (x) by 


where py, qy are, Tespectively, the Probability to the left and the probability to 
: re the probability densities for the 
ensity f (x) can be equivalent] described 
by three parameters Pr f-(2), f(a). Also, by taking acy Pr f(a), f(x) and 
a density for the problem will be obtained. 
roduct space of (0, 1] for py, the Space of densities 
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on the negative axes, and the space of densities on the positive axes. It is straight- 
forward to show that i(2,, ***, Xn) is a sufficient statistic (py) for the problem. 
Then, since the hypothesis and alternative can be expressed in terms of the 
parameter ps, we can apply Theorem 3.4 and obtain from the proof of that 
theorem the uniformly most powerful test—the sign test. 

For the two-sided location parameter problem (2.6) in Chapter 3, a uniformly 
most powerful unbiased test exists. It is a sign test and is to reject for large or 
small values of the number of positive signs where the two critical values are 
chosen to make the test unbiased of the proper size. For the derivation see 
pages 59 and 60 in [2]. 4 


2.2. Most Powerful Tests for Simple Alternatives. Most of the hypo- 
thesis testing problems outlined in Chapter 3 are analogs of standard 
problems involving normal distributions. The nonparametric formula- 
tion may be preferred in cases where the statistician feels that the distribu- 
tions are close to normality but where he wishes to protect himself by 
keeping the size of the test valid for the more general problem. In such a 
case it is reasonable to look among the nonparametric tests for ones that 
have maximum power for the alternatives which have an underlying 
normal distribution. In this section we consider a method of obtaining 
tests of maximum power for simple alternative hypotheses. 

For most nonparametric problems we have no way of handling the 
full class of tests of a given size in order to find the test having maximum 
power. However, for some problems we can characterize rather simply 
those tests that are similar. We now discuss a technique outlined in 
Section 3.5 of Chapter 2. 

Let the class of probability measures be {Po|0 E Q} over Z(.#7) and the 
problem be 

Hypothesis: 0 €w, 


(2.8) Alternative: 0 EQ — w. 
For this problem we consider a simple alternative: 
(2.9) Alternative*: 0 = 0*, 


where 0* €Q —w. Ifthe problem possesses a statistic t(x) which is both 
sufficient and boundedly complete for the measures of the hypothesis 
{P,|9 Ew}, then it is possible to describe simply the form of a similar 
size-x test. It must have conditional size « with respect to the statistic 
t(x). That is, if (x) is a similar test of size «, then 


(2.10) E.{b(X)| (2) = 1} = « 


for almost all (P/'|9 ew) values of £. The expectation has a subscript œ 
to indicate that the conditional measure is that of the hypothesis—there 
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being only one because t(x) was sufficient for 0 € w. Thus, to examine 
similar tests of size « is to examine, for almost all {P7|0 €w} value of z, 
those tests that have size « in each of the subspaces of Z given by t(x) = t. 
Of course, there is complete freedom of choice of test for values of t in any 
set having P? measure zero for each 0 € o. Because ¢(x) is a sufficient 
Statistic, there is only one hypothesis distribution over each subspace, and 
hence in the subspace the hypothesis is simple. Thus, to find a most 
powerful similar test for 6 = 6*, we can apply the fundamental lemma to 
the simple hypothesis and alternative in each subspace. Of course, for 
values of ż in a set having zero probability for each 0 in the hypothesis but 
Positive probability for 0 = 0*, we would choose (x) = 1 to maximize the 
power. We summarize these ideas in the following theorem. 


THEOREM 2.1. If ¢(z) is sufficient and boundedly complete for the 
measures of the hypothesis, 0 € o, then any similar test (x) of size « has 
for almost all {PE |0 €w} values of ¢ conditional size «, given r(x) = t. 
The most powerful similar test against 0 = 6* is obtained by finding for 


each z, the most powerful size-« test (x|t) of P,(A| 1) against Po(A|t) but 
setting Hc t) = 1 for t in the set B* 


that has maximum PZ, probability, 
yet zero Py probability for 0 €o. 


Proof. Let Z be the 


o-algebra on the space of values of the statistic 
t(x), and define Z* to co 


nsist of sets B for which 


(2.11) Ph (B) >0 


and 


P? (B) =0 


would also belong to B* 

mum. Hence the B* as required exists, 
The remainder of the proof follows fro i i 

Ge rom Theorem 3.5 in Section 3.5, 


In a few nonparametric 
powerful test wi 


Problems the method above gives a most 
following extens 


thout the restriction of Similarity. For this we need the 
10n of the concept of completeness, 


5.2] MOST POWERFUL TESTS 173 


A class of measures {Pj |0 € w} is totally complete if 
@12 Ets} = [JOPO <0 


for all 0 € w, where f(t) is a real-valued statistic, implies that f(t) <0 
almost everywhere {PF}. 


Frequently it will be convenient to say that a statistic is totally complete. 
When we do this it will be for a particular problem in which the statistic 
has a given class of measures, and we shall mean that the class of measures 
is totally complete. As with bounded completeness we have a theorem 
on the conditional size of tests, given a sufficient statistic which is totally 
complete. 


THEOREM 2.2. If ¢(x) is a sufficient and totally complete statistic for 
{P,|0 ew} over (2), then a size-a test for 0 e w has for almost all 
{P7 |0 € w} values of ¢ conditional size «, given ¢(x) = t. 


Proof. Let (x) be a size-« test. Then 
E,{$(X)} < % 
for 0m. But this implies 
ESE, {(X)| (2) = T} — a} <0 
for 9m. Total completeness then gives 
EKV a) =} S & 
for almost all {P7|0 €w} values of t. This is the required result. 


Thus, if we are interested in the size «-tests for the hypothesis, 0 € w, 
and have a statistic sufficient and totally complete for 0 €w, we can 
construct tests having conditional size « in each subspace, given t(x) = t. 
By the same type of argument used for Theorem 2.1, we obtain 


THEOREM 2.3. If (æ) is sufficient and totally complete for 0 €w, then 
the most powerful size-x test against 0 = 6* is obtained by finding for each 
t the most powerful size-« test (|i) of P,(A|f) against Po(A|t) but 
Setting (x | 4) = 1 for rin the set B* having maximum Pĝ, probability, yet 
zero Pf probability tor 0 € w. 

In applying the above theory we need a statistic t(x) for which the condi- 


tional distribution, given the statistic, is the same for all probability 
measures of the hypothesis. This is the requirement of sufficiency. The 


174 THE THEORY OF HYPOTHESIS TESTING [5.2 


condition of total completeness would be satisfied if the induced distribu- 
tions of the statistic (x) under the hypothesis were all probability distri- 
butions for the statistic. This is proved in the following theorem. 


THEOREM 2.4. If the class {Pj'|9 €w} is the class of all probability 
measures over 7 (2), then {PF |0 €w} is totally complete. 


Proof. Let f(t) be a statistic over 7 (2) for which 


(2.13) E,{f(T)} <0 
for 6 € w, and define B, by 


(2.14) B, = {t| f(t) > 0}. 


Let P? be any measure in {P7 |0 € w}, and assume that Pj(B,)>0. Then 
it is easily seen that the measure, 


(2.15) u(B) = z Í Pa) aP5(t), 


is a probability measure and hence also belongs to {P50 eo}. r(t) 
is the characteristic function of the set B ler 
But from (2.14) we have 


S() > 0, and this, together with our assumption 
that P7(B,) > 0, implies 


(2.16) i S(t) dP? (t) > 0. 


(2.13) and (2.16) provide a contradiction; hence our assumption was 
incorrect, and P7(B,) =0. This means that the probability measure of 
{t|f@ > 0} is zero for all measures in {PF|0 cw} and therefore 
establishes total completeness. 


> The following more general theorem can be proved in much the same way. 
_ THEOREM 2,5. If each measure Probability measures {P7 |0 € w} 
is dominated by a measure in {p » and if {PF |0 € w} contains at least the 
uniform distributions re each 4, Over the sets B of a basis Bas (2) of 2, then 
{PF |0 eo} is totally complete. 


in the class of 
„|n E€ H} 


Note. For the definition of a basis see Section 7 of Chapter 1, 


Note, - By the uniform 
distribution over the set B* Te the measure Hy We mean the prob; 


ability measure 
$r) 
Pass (B) = | Bake 
B* (B) BmB» du,(t). 4 
EXAMPLE 2.1. THE Two- 
independent and each X; have 
R! with density fa @). 


SAMPLE PROBLEM. Let Ay XK be 
5 1 

; an absolutely continuous distribution over 

Similarly let X, 


mts" **s Xa, +n, be independent and 
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each X,,, ,; have an absolutely continuous distribution over R! with density 
fo,{x). 0 € Q indexes the density functions over R?. Consider the problem 
of testing the 


(2.17) Hypothesis: fo (£) = fo,(%), 0,€Q, 


against the simple 


1 1 
2: 5, a e m 2 
(2.18) Alternative: fo (œ) Os exp [ Z (x — py) | S 


1 1 b 
hO = grr |- zke- 9. 
Without loss of generality, suppose {4 > Hə. 


Under the distributions of the hypothesis we have seen in Section 2 of 
Chapter 4 that the order statistic (x) = {%,°**, Ty, 4ng} is a sufficient 
Statistic. By Theorems 7.1 and 6.1 in Chapter 1 we have that ¢(x) is 
complete. Thus the assumptions of Theorem 2.1 are satisfied. We now 
consider the construction of a size-« test in a typical subspace f(x) = t. 

Under the hypothesis the conditional distribution of the outcome 
(%, °°", Bang)» given 7(x), is equal probability 1/( + n)! to each 
permutation of the set of numbers in ¢(x). The conditional distribution 
under the alternative can be derived in the same manner as the hypothesis 
conditional distribution was derived in Section 2 of Chapter 4. The result 
is that the probability to each permutation of the set of numbers in t(x) is 
Proportional to the values of the alternative density function at these 
points; that is, proportional to 


a2 nı 1 1 3 ng T 
Pe) it aaa | Zot m] U ay 


1 
X exp [= Jaht = ns 2 


We apply the fundamental lemma, Theorem 3.1, in Chapter 2, and obtain 
PX) 


p= © Vode 
=a =Cc 
=0 <= 6s 


Now, by remembering that there are at most (nı + ng)! points in the set, 
given r(x), and that each of these points has the same set of coordinates 
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which are just arrangements of the values in (x), we can see that, as a 
fuhction of x, given ¢(x), each succeeding expression below is a monotone- 
increasing function of the previous expression. 


P(X) 
1/(m + Ng)! 4 


1 5 1 Ne 
apl aot e Pa JAA mss n|, 


i=l j=l 
ui 


Lae 


— a) — > Cras — oo 
j=1 


ny na 


Ha > Ti + Me > Tanti 


i=1 


i= g=1 
ny +N, 
(kı — po) Da — Ps = Chy 
i= k=1 
ni 
(m = Ho) Tis 
i=l 
ny 
x, 


i=1 


= = 6 
where c, a are constants to be 
the test as being over R» 
will be functions of t(x). 

Using the alternative definition of the order st: 
Ti ntn), We can describe quite simply how the ch 


° chosen to give the test size. If we consider 
and not just in the subspace, given ¢(x), then c, a 


atistic, t’(x)=(eq), "5 


3 r oice of c and a is made. 

Under the hypothesis, the conditional distribution of 

given (tup: 2 E l =e of (a, ++, Tn ngs 
Tin dnp), equa probability to each permutation of 

(ast: S 


Er s 
> Cinn). For each such permutation we calculate x, cis 
> "~ 


1 


5.2] MOST POWERFUL TESTS 177 


the largest number having a proportion less than « of permutations with 
ni nı 

= x;> c. ais the probability of rejecting when 2 x, = c and is chosen 
1 1 

to bring the test up to exact size x. By Theorem 2.1 this is the most power- 

ful similar test against the simple alternative (2.18). But the test does not 

depend on o°, 4, fg provided yy > 4; therefore it is the similar test 

uniformly most powerful against the normal 


1 r l pe š 
Alternative: fo, (7) Gas exp [ 7 (x — m) | 


1 1 i 
fu, ) = y exp | AG n] 


For the two-sided normal alternative of common variance and means 


Hy, Ha With uy Æ Ho, the test that rejects for large values of the statistic 
nı na 


Jnr > 2, = D t, +| under permutations of the order statistic is a 
1 


T 
most stringent similar test and has minimax risk with respect to any loss 
function that depends on o? and jy — #4. This is obtained by using 
Theorems 3.9, 3.10, and 3.14 in Chapter 2. 

The one- and two-sided tests described above can be exhibited in another 
form. Let s(x) be the usual two-sample Student statistic for use when the 
variances are assumed equal: 


Pa 


— 2 
s9 = g - 
v2 . — p? ; z')2 
11| ea + > ena) 
Shon 
n Ns 1 1 
nmn +n — 2 
where 
n 
= tS 
Be E Tis 
mF 
ne 
aps LSS 
CS Paty 


The tests are then to reject for large values, respectively, of s(x) and of 
| s(x) | under the (7, + 7n)! permutations of the coordinates of the order 
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Statistic (x). For this itis straightforward to show that s(x) is a monotone- 


ny ne 
increasing function of n ' x — ny Dest for given t(x) = t. 
1 1 
These two tests are usually referred to as the one- and two-sided Pitman 
tests. The one-sided Pitman test is considered in Proble 


m | in the previous 
section. 


EXAMPLE 2.2. PROBLEM OF INDEPENDENCE (2.4) IN CHAPTER 3. Let 
Xj," "+, X, be independent and each have the same distribution function 


F,(x™, x), where 0 indexes the absolutely continuous distributions on 
R?. Consider the independence 


(2.19) Hypothesis: Fy(x"), s2) = Fy® (20) Fy?) (a), 
against the simple 


(2.20) Alternative: F(x, zt) is a bivariate normal with parameters 


His Ha, 04, Go, p. 


Under the hypothesis there is a sample of n from F,) and an independent 


sample of n from F,{?), By Example 2.4 in Chapter 4 we have the complete 
and sufficient statistic 1(x) = (a1). . 


ient Ur a 2, ++, 22) which is just 
the combination of the order statistics for the two samples. We construct 
4 most powerful test in the sub: 


Space, given this statistic (x), 
The fundamental lemma gives a test based on th 


e probability ratio: 
1 x (af) — h)? 
Beeb, [- NS =| a 


— 2p ZAP = mel a) he S] 
0103 ox 

Tne ` 
It is easily shown that, for given t(x), each succeeding expression below is 
a monotone-increasing fun 


ction of the previous expression: 
1 pee = m}? 
2(1 — p?) oy 


2p ZEP = male 


=H) 4 Ho = a), 


i a) ae o 
sign p E P — pet? — p, 
sign p: > a) x), 
Sign p- X (a) _ BM) (62) — go 


1 si 
sign p Z EP — aa L ge» 
[2 GP — ary GP 


sign p-r, 


= zae d 


5.2] MOST POWERFUL TESTS 179 


r is the usual correlation statistic between x and x(®. When p > 0, the 
test function can be written 


AN) =1 if EeP De) >c 
— a =c 
=0 a 


The constants c = c(t(x)), a = a(t(x)) are chosen to give the test size œ 
under the 7! equally likely pairings of x™® values with x) values. This 
test does not depend on oj, da or p when p> 0; it is most powerful 
similar for the hypothesis of independence against the composite normal 


(2.21) Alternative: F,(2, 2°) is the normal bivariate distri- 
bution with variances 7, o3, and correla- 
tion p; of, o2 € ]0, cof, p € 0, IL. 


The test can also be based on the statistic r, the sample correlation 
coefficient. 

For the two-sided normal alternative with p + 0, the conditional test, 
given z(x), which rejects for large values of | Z(a{? — #9) — z) | 
or of |r | is most stringent similar. 


EXAMPLE 2.3. The one-sided location problem. We consider the 
general linear hypothesis formulation (4.7) in Chapter 3 as it applies to 
the one-sided location problem. Let X = (Xj,--:, X,,) have a distribu- 
tion over R” with density f(2,, ++, ,);0, belonging to Q, indexes the 
absolutely continuous distribution over R". We consider the problem of 
testing the 


Hypothesis: fifty ***) n) = holei + +++ + 25), 0 EQ, 
against the simple 


Alternative: X} **', X,, are independent, and each is 
normal with mean x > 0 and variance o°. 


ny 


The statistic t(x) = > x? is sufficient under the hypothesis. For it can 
J 
be easily shown that the conditional probability measure over the sphere 
n 


X æ? = t is the uniform probability measure; that is, the measure of a 
1 

Set is proportioned to the ‘area’ of the set on the sphere. Also, under the 
hypothesis the statistic, #(x) has an arbitrary absolutely continuous 

distribution. This can be seen by noting that an arbitrary absolutely 
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continuous distribution for t(x) combined with the uniform probability 
measure over the sphere (x) = ¢ will produce a probability measure of our 
hypothesis. Then, by Theorem 2.4 in this section, t(x) is totally complete. 

Now, applying Theorem 2.3 in this section, we find the most powerful 
test by finding the most powerful test in each subspace #(x) = ft. The test 


obtained from the fundamental lemma is based on the following probability 
ratio: 


š 1 
dea [zea 
SS 


But, given the value of the statistic #(x) = ba, 


each succeeding expression 
below is a monotone-increasing function of 


the previous expression: 
i 1 

c’ exp f 203 X(x; — | 
ae | 


7 


c > 


1 
~ Je X(x; — u)?, 


La, , 


mig 
tea)" 


1 1/2 * 
Ie, — | 


n—J 
T we designate this last expression by s(x), we find the conditional test to 
e 


f(x) = 1 if s(x) >¢ 
=a =¢€ 
=0 <e. 
the conditional distribution of s(x) does not 
ubspace t(x) = + and is in fact the Student 
grees of freedom. If we let 5, be the value 
% according to this Student distribution, our 
$(x) = 1 
=0 i 
In this case the constant A 


5, Obtained from the funda: 

ant mental lemma does 
not depend on the statistic t(x). This test is the Most powerful size-o test 
against the normal alternative with y >Q. 


if s(x)> Sa 
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As in Example 2.1, we obtain that the most stringent test against the 
normal alternative u + 0 is the two-sided f test. 

If we add to the assumptions of the hypothesis that the X¥;s are inde- 
pendent, then it follows from probability theory that the X;’s are identically 
distributed according to a normal distribution. This is the ‘parametric’ 
location problem as given in normal theory. Example 3.5 in Chapter 2 
applied the theory we have summarized in Theorem 2.1 to prove that the 
one-sided ¢ test above was most powerful similar for that parametric 
problem. 


3. MOST POWERFUL RANK TESTS. AN APPLICATION 
OF THE INVARIANCE METHOD 


3.1. Introduction. Invariance theory for hypothesis testing was intro- 
duced in Section 3.7 of Chapter 2. We summarize briefly the ideas 
involved. Suppose there are transformations which can be applied to the 
outcome and which produce a transformed random variable having as its 
probability measures the given measures of the problem. If, in addition, 
the transform of a random variable always has its measure in the hypo- 
thesis or alternative, according as the random variable itself represents the 
hypothesis or alternative, then the transformations leave the hypothesis 
testing problem unchanged. The invariance principle then requires that 
attention be restricted to those test functions that are invariant under any 
of the transformations. In Section 3.2 we shall consider the application 
of the invariance method to nonparametric problems, treating in detail 
the problem of randomness. For a number of nonparametric problems 
the invariant tests are the tests based on ranks. In Section 3.3 we shall 
consider the use of ranks for a general type of nonparametric problem and 
discuss how to obtain rank tests most powerful for simple and composite 
alternatives. 


3.2. The Invariance Method for Randomness and Other Problems. We 
consider the invariance method for the problem of randomness and 
indicate its application to the problem of independence. 

Let X,,---, X, be independent, and let Y, have a distribution on R? 
With a continuous distribution function Fy, (x) where 0,€Q indexes the 
Class of continuous distribution functions on the real line. The general 
Problem of randomness is given by 


Hypothesis: 0,=--:=9,, 
Alternative: Notall0;equal; 0,,°°+, 0, EQ. 


182 THE THEORY OF HYPOTHESIS TESTING [5.3 


The special forms of the problem of randomnessare obtained by substituting 
more restrictive alternatives. 

We first define a class Y of transformations of the real line into itself. 
As a typical transformation consider sz, a strictly increasing continuous 
function. Since s is strictly increasing, the inverse function s-} is at most 
single valued, and since, in addition, s is continuous, s~ is defined every- 
where. Then it follows easily that s— is strictly increasing and continuous. 
Similarly, if s} and s, belong to Y, the product transformation 5,52 is 
easily shown to be strictly increasing and continuous. The closure of Y 
under multiplication and inverse then implies that Y 

If X is a random variable with the continuous dist 


then the random variable sX has a distribution funct 
the relation 


is a group. 
ribution function Fp, 
ion, say Fy», given by 


Feo(x) = Pry {sX < x} 
= Pr {X < sx} 
= Fisz). 
Now, since F, and s-t are continuous, it follows that Fils 
function of x. Hence, if 0 € Q and s e Z 
We now define for the problem of rand 


tions on the sample space R". A typical transformation sin G,, is given by 
(3.2) S(t, +++, ta) = (sa, +++, Sta), 


where s is a transformation in Z. s applies the same transformation $ 
to each coordinate of the outcome. 


Obviously the class G,, is a group. 
We now show that the transformations in Y, Jeave the problem of 
randomness unchanged, If we designate by § the transformation on the 
parameter space Q” 


(3.1) 


Tlg) is a continuous 
» then sd eG. 


omness a class Y,, of transforma- 


p corresponding to the transformation s on R”, then we 
ave 

(3.3) (Ons "++, On) = Oas +++, 30,). 

If (01, * + +, 6,) belongs to the hypothesis, then 0, =- - — 0, Then, for 
8(9,,°*+, 9,,), we have 50, =++- = 59 


m» and hence §(6,,+++,6,,) also 


ypo Also, if (0,, + + «, 0,) belongs to the alternative, 
then, for some i, j, 0, 4 9;. Of course we use different 6’s to designate 
different distribution functions; 


it distribut ctions; hence F; (x) + F(x) for some x. This 
Immediately implies that 30, -4 59;, and hence that 5(0,, ++, 0,,) belongs to 
the alternative. i 


belongs to the hypothesis. 


the sample space a set Sh i 
s mea rob- 
ability measures of the problem: i 


6.4) S= {en Ta)| zi =a, for some i,j with i Æj}. 
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Obviously there is a transformation in Y,, which transforms the point 
(x, °*+,x,) into the point (x1, ---, «,,), provided the relative order of the 
magnitudes of the coordinates is the same, that is provided that, if 
BS << yy, then x, ast KE Ti Alternatively, any transformation 
in Z, preserves the relative ordering among the coordinates. Therefore, 
the maximal invariant partition of R” — S divides it into n! regions, each 
of which has a given relative ordering among the coordinates. A simple 
maximal invariant function is the set r(x) of ranks of the coordinates: 
(3.5) r(x) = (Cn a) 

where (r,, +++, 7,) isa permutation of (1, «++, n), the same permutation that 
(a, +++, a,) is of (ay ts Vim). Tt is easily seen that r; is the number of 
coordinates smaller than or equal the coordinate 2. 

The natural thing to consider next is the maximal invariant partition on 
the parameter space. The real purpose of the maximal invariant partition 
is to define sets over which the distribution of any invariant statistic is 
constant. Therefore, instead of constructing the maximal sets, we define 
larger sets which are invariant but not maximal invariant. These sets 
have the property that the distribution of any invariant statistic remains 
constant within any set. However, these sets do not define a partition 
Since they can overlap without being identical. For convenience we 
define the sets over the space of distribution functions corresponding to 


Q”, Consider the sets 
(3.6) {hy(Fo(a)), «+ *s hy (Fol) [0 € 2}, 
where the h, are monotone-increasing continuous functions defined on 
(0, 1]. 

First, any ‘point’ (Fy,(#), ** 
let F,(x) be any strictly increasing 
define 


-, Fy, (x)), belongs to one of these sets. For 
"distribution function with 0 € Q, and 


h(t) = Fo (Fo). 
Then obviously ee 
UET) = LOM» 
e above definition of the h; Second, 


and the point belongs to (3.6) with th 
po! E ( For, under this trans- 


the sets are invariant under any transformation s. 
formation, the transform of 
(Ia (Fo); Se h,(Fo)) 
is 
(h (Fos ™®, °° 1s Hn Fos)» 


and, since Fs is a continuous distribution function, these points are in 


the same set. 
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THEOREM 3.1. Any invariant (rank) test for the problem of randomness 
has a power function which is constant-valued within any set 
{Fo **+, A, F)|0€Q}. Forthe hypothesis {(Fy, ++ -, F,)|0 € Q} against 
the alternative {(h¥F,, ++, hh Fi)|0 € Q} with given A¥, +++, h*, there is a 
most powerful invariant test. 


Proof. We prove that the distribution of the maximal invariant 
Statistic is constant within any set of the form (3.6). The two statements 
of the theorem then follow immediately. For this proof let s designate a 
typical strictly increasing transformation on the real line. Because of the 
strict monotonicity, the inverse function is at most single-valued, and its 
Tange may be extended to the entire real line by requiring that s~ be 
nondecreasing. This insures that s is always a continuous function. 
Hence, if X has a continuous distribution function F((x), then the distribu- 
tion function of 5X, F,(sx), is also continuous. Let 50 be such that 
F(x) = F(s). 


If Fy (x) and F,,(v) are any continuous distribution functions, we now 


nondecreasing transform 
function F,(z), Sy 
Fy (a) = Fs 


ly t ence sı is strictly increasing. 
s2 can be similarly defined, 

The rank function r(x) j 
transformation s. 
Over each set (3.6). Let (Fo, +++ 


points in a set (3.6 


»h,F,,) and (Fo, ** +, h Fy,) be two 
Then 


be defined as in the paragraph above. 


Prin(X) € A| (hFa, +++, hnFo,)} = Pr {r(s,X) € A| Fp, «++, hy F,)} 


= Pr {r(s,X) e A| (ly Fy, +++, hy Fy)} 
= Pr {r(X) € 4| (iyFo,, +++, hy F,)}- 


This proves the theorem. 


actly the problem of independence. Let 
12) be independent, and let each (X{), x) have 
bution function F aP, 2) where n € Q* 
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indexes the continuous distribution functions on R?. The problem of 
independence is given by 
Hypothesis: F, (2%, x)= F,0(«)F, Oe) 
for all a, z; EQ, 
(3.7) Alternative: F,(#®, a() 4 FPF) 
for some #, z; yeQ*. 

As a typical transformation of the outcome (4, af; +++; a, a), 
consider applying to each 2‘) a transformation s belonging to the group Y 
defined earlier and to each x a transformation s’ also belonging to Y. It 
is easily seen that such transformations leave the problem unchanged. 
The maximal invariant function for this class of transformations is the 


combination of the set of ranks for the <{P, say 


rah) = OP, + al), 
and the set of ranks for the a, say 
r(x) — e, e ro). 
Let A(u™, u!) be any continuous distribution function defined over the 


unit square, 0 <u, u® <1. We now define sets of distribution 
functions over R%. As a typical set consider 


(3.8) (F), Fo(a))|0 E Q}, 


where 0 € Q indexes the continuous distribution functions on Rt. Then 
we have 


THEOREM 3.2. Any rank test based on (r(x), r(x‘®)) for the problem of 
independence has a power function which is constant-valued within each 
set (3.8). There is a most powerful rank test against the alternative 
{h* (F (20), Fy(x™))| 0 € Q} for given h*. 

Proof. Similar to that for Theorem 3.1. 


3.3 Most Powerful Rank Tests. In Section 3.2 we showed how the 
invariance method reduced two general types of nonparametric problems 
to a consideration of tests based on ranks. Rank tests, however, have 
Other properties which make them desirable, regardless of whether or not 
they have for any particular problem the justification of invariance. For 
example the consideration of rank tests has a certain mathematical 
Simplicity in that, for the problem given in terms of the ranks, the sample 
Space is finite. Also, rank tests are frequently easy to apply, and for 
many problems tables are available. In this section we consider a general 
type of problem for which rank tests can be applied and show how to 
obtain rank tests most powerful for a simple alternative. 
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Consider the sample space Z = RY with outcome 


x= (Ops Gia gE Ton); 
: e ae 
where N => ™ Let X designate a random variable over RY with 


1 
probability measure P¥,0¢. We assume that, for each 0, the prob- 
ability measure of the set S of diagonal points is zero: 


S= {x|forsomei, 2; = s; for some j + j"}. 


Thus we can in effect ignore the set S and use as our sample space RY — S. 


Assume that, for each 0 €w, the probability measure PX is invariant 
under any permutation of the coordinates Tte 


> in, for each i. Then 
consider the hypothesis testing problem 


Hypothesis: 0 €o, 
(3.9) YP! 


Alternative: 0€Q — o, 
and in particular the 
power for a simple 


(3.10) Alternative: 0 = 0*, 


We are, of course, interested only in alternatives 9* for which the prob- 
ability measure PE is not invariant under all the permutations mentioned 
above. 


problem of obtaining a test which has maximum 


Consider the set of ranks 


(3.11) r(x) = (re), ++ "Bin 5005 ay, +++, Xpy,))- 
This set comprises the ranks for tm 
the ranks for Vors ` *, Ton, (a permutation of i 
Problem 14 is to construct 


ns and a parameter 
space Q such that 


on. Because of the 


l > f obtaining a size-x rank test most powerful 
for the simple alternative (3.10). Under the hypothesis the probability 
measure is symmetric under the permutations within each group of 
coordinates Gass, Tin). Thus t 


he outcomes x producin lue of 
late 1 g any value 
the statistic r(x) can be obtained from the outcomes producing any other 


value of r(x) by the Permutations of the above form. Hence, each value 
of r(x) has the same probability, and this probability must therefore be 
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We have that, under the hypothesis, the induced distribution of r(x) does not 
depend on 0 and is equal probability to each of the n,! > + + m! possibilities. 
Under the simple alternative, there is of course a single induced distribu- 
tion for the statistic r(x). Now, with a simple hypothesis and a simple 
alternative for the problem expressed in terms of r(x), we can apply the 
fundamental lemma of Chapter 2 and find the most powerful rank test. 
Since the density function under the hypothesis has a constant value, the 
most powerful test has the form 


d(r) = 1 if PEG) S< 
=a =c 
=0 4; 


where PÈ is the probability measure for r(x) induced from PX over RY. 
To conclude the theory in this section, we derive a theorem which can 
assist in the calculation of the probability distribution PR under the 
alternative 0*. 

Suppose that the alternative probability measure Pjt is absolutely 
continuous with respect to Lebesgue measure and has density function 
J*(x). Also suppose that under the hypothesis there is a probability 
measure PÈ which has a density function f(x) greater than 0 whenever 
J*(x) is greater than zero. Because of the symmetry of the hypothesis 
measures, we can always choose the density f(x) to be symmetric under the 
permutations within blocks as defined above. We define an order 
Statistic x° for the outcome x considered in blocks, 


x° = (aya). *s Maing” "> Voy °°» Xp (ny) 
where, for example, (tya) ** ° itn) is the set (%),°°*, tin,) arranged in 
order of magnitude xq)<"** < timp BY the same argument that 
produced the distribution of the order statistic for the uniform distribu- 


tion (see Section 3 of Chapter 4), we obtain the probability density 
function for x® as induced from the hypothesis density f(x); it is 


n! em! fE, 


Over the set [= {x|ay<***<%1n3°° a < < tyn} in RY and 
zero elsewhere. 


Tr fa) from (1,°**,743°°73 1, +*+, m). Then 


188 THE THEORY OF HYPOTHESIS TESTING [5.3 
coordinates of x. Also let S, designate the points x in RY for which the 
rank statistic r(x) takes the value r. Then we have 
Prk (X) =r} 
=| J”) d(x) 
St 
=| as 
I 
*(x,) 
= [FO x.) ax 
r f(xy) 
* ’ 
as | m! ny! f(x) dx 
mre mk Ir fx) 
J x [£ al 
a Ej 0) e 
Moron! SR) 


Thus under the alternative the probability that the rank statistic takes the 
value r is a constant times th 


function of the order Statisti 


If f(x) and S*(x) are probability densities 
ve te Lebesgue measure and if f(x) > 0 


whenever f*(x) > 0, then the Probability measure for the rank statistic 


under the alternative is given b 


y 
* (Xo 
(3.12) n! n! Pros {r(X) = r} = EX (oo 


where r’ is the permutation of (1, - +. 


and is applied to the coordinates of the order Statistic X° to obtain the X% 
which occurs in the expectation. 


EXAMPLE 3.1. As a particular case of the problem of randomness we 
consider the two-sample problem. Let y,- Ias Aea Xnytng be 
independent, each X =, m) have a continuous distribution 
function Fa (x), and each Xn G= yae ñ) have a continuous distribu- 
tion function Fo 


f (”), where 0,, 0, (e Q) index the continuous distributions 
on RI, 


statistic r(x) = fhin Tn+n,)- For the pro 
distributions of r(x) we have a sufficient stati 
æ’s in the ‘first sample’ is symmetric, 


blem in terms of the induced 
stic. The distribution of the 
From this it follows that the induced 
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distribution of the ranks for the ‘first sample’ is also symmetric; that is, 
given any set of ranks for the first sample, the relative ordering within the 
sample is equal probability to each permutation of the elements of the set. 
Similarly for the second sample. Hence a sufficient statistic is 


fri siah Ung Pasa): 
But to know the set of ranks in the first sample is to know by elimination 
the ranks in the second sample, and conversely. Hence either set by 
itself forms a sufficient statistic, and for later convenience we choose the 
second set. Our sufficient statistic for the rank problem is 
£(0) = {raya "> Tagen 


or more conveniently the ordered ranks 


(3.13) (Saas 
where 5s,,++-,s,, are the integers p+r `" "> Tnytng arranged in order of 
magnitude sı < `+- < Sp, By Theorem 3.2, Chapter 2, we can confine 


our attention to tests based on the statistic f(r). 
We now consider the invariant sets of distribution functions (3.6), as 


defined for Theorem 3.1. Since just two distribution functions Fy (x), 
Fo (2) completely specify a probability measure for the problem, we can 
simplify the description of a typical set (3.6). Consider the set of pairs 
(Fo,(x), Fo (2)) as given by 
{{hy(Fu(2)), (Fo) 0 € Q}; 

the two distribution functions in a pair refer, respectively, to the first and 
Second samples. It is straightforward to show that the sets having hy 
Strictly increasing can be represented by 


(3.14) (Fy, AF) | 0 € Q} 


We now consider the probability measure for (Gy, °°?" Sn.) corresponding 
to the distributions given by (3.14). Assuming that Fọ and A(F,) are 
absolutely continuous with density functions, f(x) and f*(x), we have 


fo) < h(Fo(2)) 


e d 
Jo) We F(x) 


= h'(Fo(x)), 
where 


P d 
k'(u) = T h(u). 
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Then designating by (S4, ` + +, S,,) the random variables with the induced 


distribution of t(r) = (s,,- ++, Sn,) and applying Hoeffding’s Theorem 3.3, 
we obtain 


oe = x 
Prom {S1 = spt, Sn, = 


F Te Engr, WX a) °° h (F(X enp) } 
ny 2)! 


The factor 7 !n,! takes account of the n!ng! different values of the rank 
statistic (r4, ++ 


` Fn,+n,) Which produce a given set of ranks for the second 
sample (s1, ** +, sa), and Xes» for example, is the syst order statistic in 
the set %,--+, Xatm Now, when X has the continuous distribution 
function F(x), then F(X) has the uniform distribution; therefore 
(3.15) Pr Potro {Si = 515° °°, Sn, = Sn, 
1 
Troy Et (Utes) e AU ))}, 
(n ae m) {h'(U, v) ( (ng) } 


Ng 


where (Uy), +++, Venn) is the order statistic for a sample of m + Mg 
from the uniform distribution. 


We now consider a sim 


ple function h(u) and evaluate the probability 
measure of the ranks s,, +- 


“Sny Letting h(u) = u*, we have the 


(3.16) Alternative: (F,, Fy) € {(Fy, F3)|0 € Q, fixed k}. 


For this alternative a random variable in t 


distribution function Fj(). If k is a Positive integer, such a random 


variable is equivalent to the largest of k random variables having the 
distribution function Fa associated with the first sample. 


Prp yz a} = Fy 


he second sample has the 


= [Pre BES x} 
k 
= Pr p {max ¥, < x}, 
i=1 
The class of distributions in the alternative 
ample of the class ( 


of a second sample val 
t sample distribution, 


(3.16) corresponds to a 
3.14) and has the easily 
ue being equivalent to the 


5.3] MOST POWERFUL TESTS 191 


For the alternative (3.16) with h(w) = u* we have h'(u) = ku’, Then, 
substituting in (3.15), we obtain 


Breet = p Seat 
(3.17) 


koe A 
=~ = {Usp il Ugs, 1" 


n + Ng 

( ni ) 
To evaluate this we need the joint distribution of (Ug, *** Uss, )): 
For a single order statistic Ug, this was obtained in formula (3.5) in 
Chapter 4. For two order statistics, see Problem 5, Chapter 4. Ina 
similar manner it can be shown that the joint density function for 


(Teas Us, )) is equal to 


(3.18) Ny +n + 1) TI Gs a u)! 


j=0 


-+ < u„ „1 = 1, where for convenience we 


over the region 0 = tọ < 14 * 
define So = 0, Spey = M + M + Land ug = 0, Uy. = 1, and where we 


associate u; with Uy. Now, making the change of variable 
rig 


(j= t e. nə), 


Uj = VVjy Un 


s i -1 i 
and noting that the Jacobian has the value v? v} + vae, we obtain the 


joint density of the v’s, 


ng—1 


T(n + No aot oes pe 
mt mt D figor TI Orat Oaa O Oe) 
j=0 


J=0 
(3.19) = pata Hitt Tia- o» 
Ti TS — $) as 
j=0 
na T(Sj44) poo a— ves}, 


++, na), Where again for convenience 
he factoring of the density function 
++, V,, are independent and V, has 


Over the region 0 < y <1 (j= 1," 
we let vp = 0 and Ungit = L From t 
Over the product set it is seen that V4, ° 
the £ distribution with parameters Sj Sı ~ $7 
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Now, since u *** u,, = vyv3 +++ v2, our formula (3.17) becomes 
Pr ppt {S1 = 85S. 


ke na r 
(3.20) = —— [evi 


My + Ny) j=1 
Nog 


kus te D(s;+ jk —j) Pls) 
i a "i J= T6s,) Pisza + jk —j) 


ny 


This last step is obtained by noting that the rth moment of the f distribu- 
tion with parameters p, q is 


E+), T+ | 
Mp) Mp+q+n 
In the particular case when k = 2, formula (3.20) becomes 


(3.21) Prpp: {Sy = spt, Sn, = Sng 


ay -i ry Gi tj— D'et Ds 
[m + a) F=1 (Sia HJ — D+ Ga + 1Sa 


Ng 


oe (S2 + 1) +++ (Sna + tg — 1) 
‘‘ + 3) (m+ e+ 1)(m n 2) + (m+ 27) 
Ny 


These formulas, (3.20), (3.21), can be used to calculate the power of any 
rank test against the alternative given by (3.16). We use them now, 
however, in conjunction with Theorem 3.1 to obtain most powerful rank 
tests. Against the alternative (3.16) with k = 2, we order the values of 
the statistic (s, +++, s,,) according to their probability under the alterna- 
tive and obtain the most powerful test 
P(Sy,°° +, Sng) =1 if s(s, + 1)--+ (Sn, iy — D> c 
=a =c 
=0 Zø 
where a, c are chosen to give the test proper size. 


We consider another alternative in which the function h(u) takes the 
form 


(3.22) h,(u) = qu + pu, 
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where O<p<1 and p+q=1. We give an interpretation to this 
second sample distribution function, qF + pF*. A random variable from 
this distribution is equivalent to a random variable chosen with probability 
q from the distribution F and p from the distribution F*. We now show 
that the familiar one-sided Mann-Whitney test is the rank test maximizing 
the power when p is very close to zero, a locally most powerful test for the 
alternative (3.22). 

Let P,(p) designate the power function of a test ¢ against alternative 
(3.22). | We wish to find the size-« test that maximizes P;(0), the slope of 
the power function at p = 0. From (3.15) we obtain 


d 

dp 

(3.23) 1 [alice ; \ 
ap el te See 

(ea iez t(U) i MUs) a 


Ng 


Prepay) {Sy = sot s Sng = Slan 


Since h'(w) = q + 2pu= 1 + pu 1), (3.23) becomes 


1 na 
BS Osi 
(" $ ") px (s;) | 

Ng 


1 2 LD 
S; m| š 
k + =) n +m +1 L 


J 
Ng 


ximize the ‘power’ when calcu- 


We wish to maximize P/,(0); that is, to ma l 
Noting that (3.23) is a strictly 


lated using the ‘measure’ given by (3.23). 


Ng 


increasing function of 2 s; and applying the fundamental lemma of 


g 
Chapter 2, we obtain the test 


Also Sn) = 1 if = >e 


= a = cC 

= 0 <6, 
where a, c are chosen to give the test size «. This is the one-sided Mann- 
Whitney test. However, when we introduced the test in Example 1.1 in 
Section 1 we based it on a statistic V. Problem 15 in Section 7 is to show 


na 


that > s; is a strictly increasing function of V (actually a linear relation- 


1 . 
ship). Thus the two definitions of the test are equivalent. 
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EXAMPLE 3.2. We consider again the two-sample problem, but in this 
example we find a rank test most powerful against an alternative involving 
normal distributions. Using the normal distribution having density 


function, 
ny 


1 1 1 £ 
(2) —+ exp | (= WP Fa > En] 
(270?) nm + Ny 207 £ TP aye 


we consider the 

(3.25) Alternative: fg > 4; Jy, fg E R', co? € ]0, cof. 
Unfortunately there does not exist a rank test uniformly most powerful for 
this alternative. However, the most powerful test for the alternative 
specifying 444, Hə, o° is found to depend only on the ratio (u, — #)/0. 
Also, when the ratio is small, the test takes a particularly simple form and 


is called the c, test. The c; iest is the /ocally most powerful test against a 
one-sided normal alternative. 


We now use Theorem 3.3 and the reductions made in the previous 
example to find the probability distribution of the rank statistic. As 
hypothesis and alternative densities f(x), f*(x), consider (3.24) with 
Hg = 4 and u = p + 6, respectively. Then we have 


*(x) 1 na 1 na 
- = exp [- Jo Y Erei = fg m0) T I m] 
j=l jai 
i na 


Nog 
= exp [40> tntj — zea Cra + 8) o] 
j=1 


ne "TL 5 \? 
s taal 
= exp -eu +95] D A 


y! 


y=0 


We now apply Theorem 3.3 somewhat in the form of (3.15) and find that, 
under the alternative u, — u, = ô, 


(3.26) Pragy {S1 = spt, Sn, = Sp, 


-Eala 


! 
ls 7! 


= rfi + a+ <+ e) (4) +w), 


where k is a positive constant re the s’s, 


c{s) = z| xo] } 
j=1 
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and (Xun tt, Xingen,)) is the order-statistic random variable with its 
hypothesis distribution. The second step above is valid since the expres- 
sion has continuous derivatives re 6 of any order, and the conditions for 
differentiating under the sign of integration are fulfilled. 

To find a most powerful test we order the possible values of (s4, °**, 5,,) 
by the alternative probability (3.26). However, this probability has an 
unwieldy form unless 6 is small. By choosing 6 sufficiently small we can 
order by the first term in (3.26) rather than by the whole expression (only 
a finite number of values for (s,,°**s S,,)- Thus for 6 sufficiently small the 
most powerful rank test is given by 

Psy Sng) = I if aee 
=g =c 
=0 <6 


where a, c are chosen to give the test size y. The statistic c,(s) can be 


evaluated: 
a9) = ap) Xu) 


j=1 
=> HX) 
j=1 
ng 


= ED, E{Z 5} + ko 
j=1 


where (Zips ++; Zega) S the order-statistic random variable for the 
normal distribution with mean 0 and variance 1, and ky, ko are constants 


re the s’s. The statistic c,(s) is a monotone-increasing function of 
na 


E{Z,,.)}; therefore the test can be written 


na 
denl if > t 
j=l 


= 


1 


=a 
== () <c. 

inst one-sided normal alternatives 
The method of applying the test is 
statistic (a) * “s Vinn) that 
ed value of the corresponding 
t normal and to reject if the sum 
This test, derived by Terry 


It is a uniformly most powerful test aga 
having (u, — u,)/ô small and positive. 

to replace each component of the order 
falls in the second sample by the expect 
component of the order statistic for the uni 
of such values for the second sample is large. ] 
[10], was proposed originally by Fisher and Yates in [11]. 
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4. THE LIKELIHOOD-RATIO METHOD 


The likelihood-ratio method was one of the earliest contributions to 
the theory of hypothesis testing; it was introduced by Neayman and 
E. S. Pearson in 1928. We outline briefly the method. Assume that the 
probability measures can be represented by a class of density functions 
{folx)| 9 € Q} with respect to a fixed measure (A) over X(./). For the 
problem, 

Hypothesis: 0 Ew, 
Alternative: 0 EQ — o, 
the likelihood-ratio rest is given by 


(x) = 1 if L@)>ec 
= 4 = c 
=0 <6, 
where L(x), called the likelihood ratio, is defined by 
pf 
(4.1) E@) = f 
i sup fole) 


Ew 

In their original paper Neyman and Pearson used the reciprocal of L(x) 
and considered only nonrandomized tests. In the case of a simple 
hypothesis and simple alternative the likelihood-ratio method usually 
produces the most powerful test. For consider the likelihood statistics: 


Lo = P Sa) 

fi oE) 

1 al. 

Po) 

For large values of the likelihood ratio L(x), there is a strictly increasing 
relationship (actuality equality) with the probability-ratio statistic, 
Jo fo K), upon which the most powerful test is based, For other than 
this simple case, the success of the likelihood ratio in producing a good test 
is usually due to the fact that the likelihood test is always a function of a 
sufficient statistic for the problem; this fact follows from the factorization 
of the density as proved by the Halmos and Savage Theorem 5.2 in 
Chapter 1. The main justification for the method still remains its past 
success in producing workable tests often with good properties. However, 
an example has been given by Stein [13] where the likelihood-ratio test 
does worse than the simple randomized test which rejects with probability 


= max 
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a, regardless of the outcome. In the special case where the outcome is a 
sample of n from a distribution over a component space, the likelihood- 
ratio test has been shown to have asymptotically good properties as n—> 00. 

For the nonparametric problems of Chapter 3, a direct application of 
the likelihood-ratio method fails; both the numerator and the denominator 
equal infinity, and the statistic is indeterminate. If the sample space is 
R”, one modification of the method would be to consider for each point x 
the ratio 

ewe PC) 

* = —— 

POS sup PACON 


(4.2) 


where C(x) is a rectangular region with faces parallel to the coordinate 
planes and centered on the pointx. Unfortunately, this statistic, although 
determined, is usually equal to a constant. Problem 21 is to illustrate this. 
Another modification was proposed by Wolfowitz [7]. For the problems 
of randomness and independence he suggested that the invariance 
principle be used to reduce the problem to a consideration of rank tests and 
that then the likelihood-ratio method be used for the problem in terms of 
ranks, We illustrate his method for the two-sample problem. 

Consider the two-sample problem, using the notation of Example 3.1 
in the previous section. For the problem in terms of ranks we have the 
Sufficient statistic, r(r) = (Sp °** Sn), Which is the ordered set of ranks 
for the x;s falling in the second sample. To apply the Wolfowitz likeli- 
hood-ratio method, another statistic equivalent to t(r) is more convenient. 
To define it, consider the order statistic for the combined samples: 
ay, +, ay +n,)) Replace each element of (vq), °**> Xn tng) byal or 
a2, according as the element is equal to an x in the first sample or an x in 
the second sample. Let the statistic be 

en) = (vt Vn,tng)? 
n clearly to know the ranks of the 
equivalent to knowing which of 
first sample and which from the 


where each v; is eithera 1 ora 2. The 
set of x’s producing the second sample is 
the ordered sequence of x’s came from the 
second sample. We use the statistic r(x). : 
The likelihood ratio for the problem in terms of ranks is 


sup Pro {f*(r(X)) = t*} 


sy IER 
Ipe sup Pry {e*(r(X)) = *} 
0Ew 


= i + "i sup Pro ea) = rh. 


ny oen 


(4.3) 
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To evaluate the supremum in this expression seems in general to be far too 
difficult. Wolfowitz therefore suggests an approximation. For each 
value of t*, Wolfowitz replaces the full class of probability measures by a 
parametric class which depends on r* and seems natural to the purpose of 
maximizing 

Pr {t*(r(X)) = r*}, 


For the sequence t* = (v, +++, Unna) let /,; be the length of the jth run 
of 1’s and let /,; be the length of the jth run of 2’s. Then if, for example, 


First ie = 
sample 
Second $i j ee 
sample 


Figure 18. The relative position of the first and second sample x’s 
when m,=4, n,=3, and 1*=(1, 1,2, j E A A E 


t* = (1, 1,2, 1, 2, 2, 1), as is illustrated in Fig. 18, then 4, = 2, ha = 1, 
l3 = 1 corresponding to the 1 runs, 11, 1, 1, and hy = 1, dog = 2 corre- 
sponding to the 2 runs, 2,22. We now define a probability distribution for 
the first sample with parameters py, py, ***, and a probability distribu- 


tion for the second sample with parameters Par Poo, ***. Take a set of 2’s 
First P, |] P. 
sample E = Pig >x 
density 
Second 
sample Par Poo >x 
density 


Figure 19. Typical density functions for the first sample (parameters 
Paii» P12, P13) and for the second sample (parameters Pats P22). 


which give rise to the chosen value of t*(r(x)). Let the first probability 
distribution have probability p}; over an interval containing the jth run of 
first-sample x’s (for the j, = 1, 2, +++, which correspond to positive values 
of hi ha» ***), and similarly for the second distribution using second- 
sample «’s, The first- and second-sample intervals are chosen so that they 
do not overlap. (See Fig. 19.) 
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Now, to obtain ¢*(r) = ¢* for samples of 7, and n, from these prob- 
ability distributions, it is easily seen that the first- and second-sample 
%’s must occur exactly in the pattern of the x’s above for which the two 
parametric distributions were defined. Then we have 


n! 


Ng! 5 
(4.4) Prip; (EX) 1*} II! Mp TI thd Pe 
j go 


It is Straightforward for fixed /;; to maximize this expression as a function 
of p;; subject to the restrictions > Pu =}, > pu = 1, and p; > 0 for all 


j j g z o 
i, j. In fact, we have two multinomial distributions and the ‘estimates : 


hy 3 lz; 
ee i hma 
Pij , 24 n 


are such as to maximize (4.4). The maximized value is then, of course, 


m! n! T a" (2 r” 
s * = a oo: = TI (oa å 
to) Pring CO) } TT 45! TT 4,! jp Mn i \ Mg 


J 


This is the statistic upon which Wolfowitz based his ‘modified’ likelihood- 
Tatio test. An equivalent statistic is 


la” 

iy di 

1 => 

mec] Zi I] i 
1 UF N 


Or, taking logarithms, the statistic 


T= > hy 
ij 


Where ly is defined by 
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The Wolfowitz two sample test is then given by 


dx)=1 if Dee 


5. DISCONTINUITIES 


The theory developed in the earlier sections of this chapter has been 
concerned primarily with continuous and absolutely continuous distribu- 
tions. However, in any application the numerical measurements are not 
free to take any value on the real line but must take values in a finite or 
countable set determined by the number of decimal places to which the 
numbers are recorded. Thus our models are only approximations to the 
correct models which are based on discrete distributions. Although in 
most cases we hope this approximation is quite good, we can no longer 
overlook the possibility that some of the measurements are equal. Most 
of the tests discussed earlier remain valid: that is, of the correct size. 
However, for the rank tests the problem arises of how to assign the ranks 
when there are ‘ties’. One procedure is to use the average of the ranks that 
would have been assigned had there been no ties. An alternativet 
procedure is to randomly assign with equal probability the ranks that 
correspond to a set of tied measurements. If the hypothesis specifies 
symmetry, so that, in the continuous case, each permutation of the set of 
ranks has the same probability, then obviously, in the case of discrete 
probabilities, with an equal probability assignment of ranks in the case of 
ties, the different permutations of the set of ranks have equal probability 
under the hypothesis. Theorem 5.1 shows for the two-sample problem 
that properties of size and unbiasedness for rank tests derived under 
continuity assumptions remain valid when the continuity assumptions are 
dropped and tied ranks are assigned randomly. 


For the theorem we assume that a e E TE AEAEE A +n, ate 
. $ 1 T H A 
independent, that each X, (i= 1,- -,m) has the distribution function 


F (æ) and that each X, ,; has the distribution function F(a) G= 1+ n). 

THEOREM 5.1. (LEHMANN). If & is an event whose occurrence depends 
only on the ranks r,,°--, Fan, Of the coordinates of the outcome 
(%° 7's Zeng), and if in the cases of ties among the x’s the different 
possible rankings are assigned at random, then for any Fy, Fo, with 


f Putter [14] has shown that, for large samples using the Wilcoxon two-sample test 
or the sign test, the first procedure has a greater efficiency. 
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discontinuities there correspond Fi, F% which are continuous and for 
which i 


(5.1) Prr ri, 6} = Prey, ro 6} 


and Fj. = Fi if and only if Fy, = Fo, 


Proof. We first give a construction for Fi, Fx. Let a, a,°°* be 
the countable set of points for which Fy, or Fo, has discontinuities. 
Construct from F,,, Fy, distribution functions Fj, F$? by separating the 
Probability measures at a, by an amount 1/2 (1/4 in each direction) and 
replacing the probability discontinuity by the same probability uniformly 
distributed over the gap. Similarly define F, FE) in terms of Fi), FP 
by Separating the measures at a, by an amount 1/2? and replacing the 
Probability discontinuity by a uniform distribution over the gap. It is 
easily seen that the sequences (F{®, F®, +), and (E29, + +), ‘must 
Converge to distribution functions, say F}, Fẹ» The equal-probability 
assignment of ranks at discontinuities then is equivalent to the equal- 
Probability assignment obtained from the corresponding uniform distribu- 
tions. This establishes (5.1). The remainder of the theorem follows 


Immediately from the construction procedure. 


6. PROBLEMS FOR SOLUTION 


1. For the two-sample problem as given for Theorem 1.1., show that the Pitman test 
for Slippage of the second sample to the right is unbiased. This test can be described as 
a conditional test, given the order statistic for the combined sample ¢(x) = (tar ***, 
Pmisng). Under the hypothesis that F(x) = G(x), each permutation of (ti * * Pineng) 


as the same probability 1/(, + 12)! of being the outcome. Under this conditional 
ii ni 


# =n," > x, and &” = 


distributi z oe 
‘Stribution the Pitman statistic f(x) = * 
z 1 


ae > tas has an induced distribution. Let fabe the point exceeded with probability 
I 

« ; e eres 

st according to this distribution; fz W 

Statistic (x), The test is obtained by rejecting t 


IS greater than fy. T ae i 
e ‘,. To get exact size, in genera > 
the Pitman test is a valid unbiased test for the extended problem as given by (1.3). 


2. For the two-sample scale problem as given in Section (3.4) of Chapter 3, show that 
ne following test proposed by Lehmann is unbiased (use the criterion given in Section 
l of this chapter). Let W be the proportion of quadruples (VaV; nye Taps y) for which 

“nisi — %q..4| >] ay — vi, and let Wy be the value exceeded with probability œ 

ner the equally likely permutation distribution of (xy. t+ Tieng») Under the hypo- 
esis. The test is to reject the hypothesis if the observed value of W exceeds Wg. 
randomized test is needed in general to obtain exact size %. 


ill of course depend on the value of the order 
he hypothesis if the observed value of f(x) 
la randomized test is needed. Show that 
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3. Consider the two-sample problem (3.1) for k-variate distributions as given in 
Section 3.1 of Chapter 3. Construct an unbiased test on the basis of the following 
suggestions. As a parameter to discriminate between the hypothesis and alternative, 
consider 


“lf 


where 


The problem is to test the hypothesis p, = ps against the alternative pı > pe. If we can 
find events for p, and pz, this binomial problem has of course an unbiased similar test. 
(See Problem 33 in Chapter 2.) For Pı and pa, consider, respectively, the events A and 
B. With probability 1/2 observe either xı, Xs Or X),.1, Xn,+2 and with probability 1/2 
observe either xs or X,,,,3. Denote the three outcomes by Zi, Zs, Zy and define A as the 
event 

Zis Zan S 2%, for s=1,-++,k, 


Observe x4, Xn,+4 and with probability } either x, or Xn,+s» Designate this last outcome 
by z;, and define B as the event 


Lass Caras S Zs for s= 1, sk 

4. Consider the problem of independence, (2.4) in Chapter 3. The independence of 
the coordinates X}", X{*) is equivalent to (X, X{) having the same distribution as 
xg”, xy". Using the two-sample unbiased test in the previous question, define an 
unbiased test based on a sample of for the problem of independence. 
; 5. Let Xi, +++, Xn be independent and each have the distribution Po on R* where 0 
indexes the absolutely continuous distributions. Construct an unbiased test of the 
hypothesis that the distribution is symmetric about the origin against the alternative 
of asymmetry. (See Section 2.3 in Chapter 3.) Consider the following suggestions. 
For four 2's, say 2, ta, wy, £4, define the events: 
A: Exactly two of the 2’s are positive. 
B: If Ais satisfied and x;, x; <0 < 2,, 2, then B occurs if either 2j, Ty < Ep Vi OF 

Uj, Uy > Ly, Ty. 

Show that the maximum probability for AB is 1/4 and corresponds to symmetry about 
the origin. 

6. For the bivariate Problem 2.6 in Section 2.1 show that the sign test based on 
af?) — ah, oo, g _ alld ig a uniformly most powerful test. 

7. Show that the statistic i(x,, ---, æn) is a sufficient statistic (ps) for the two-sample 
location problem as formulated at the end of Section 2.1. By applying Theorem 3.4 


in Chapter 2, prove that the sign test is uniformly most powerful for the one-sided 
location problem. 
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8. Prove that an unbiased test for the two-sided location problem, (2.6) in Chapter $, 
is a similar test by using the method of proof found in Theorem 3.5 of Chapter 2. Then 
use the theorem in [2] to derive the uniformly most powerful unbiased test. 

9. For the two-sided location parameter problem based on the median, show that the 
two-sided test is most stringent. Use Theorems 3.9 and 3.10 in Chapter 2 and the tech- 


nique of Section 2.1 in this chapter. 

10. Apply the theory of Section 2. 
alternative, (3.3) in Chapter 3. Consider a 
a normal distribution, and show that the most power! 
values of X(a;— #)(c; —@) in each subspace 1(®1 °° tn) = an" Tm) = t. 
What is the most stringent similar test for the two-sided problem (3.4) with normal 
alternative ? 

11. Apply the theory of Section 2.2 to the problem of location as given with the 
assumption of symmetry, (2.7) with Assumption 2.a in Chapter 3. Consider a simple 
alternative for which each X; has the same normal distribution with positive mean §, 
and show that the most powerful similar test is to reject for large values of Ex;, given 
the statistic r(x) = Hahk |en] } What is the most stringent similar test for the 
two-sided problem with normal alternative, (2.8) of Chapter 3? To show completeness, 
note that (x) is the order statistic for an arbitrary absolutely continuous distribution on 
10, ©[. Theorems 7.1 and 7.3 in Chapter 1 extend to cover this case. 

12. Apply the theory in Section 2.2 to the randomized-block problem, (4.3) in Chapter 
3, with r = 2, Consider a simple alternative for which Xi X123 03 Xen Xe are 
independent and each has a normal distribution with variance o?, the X; with mean /4, 
and the X,, with mean zis (ts > #41, Say). Show that the most powerful similar test is to 
reject for large values of E(t; — a) given the statistic 1(x) = {{tin tia}, 17" (ers 
®ea}}, What is the most stringent similar test for the two-sided problem (ui F Ha) 
With the normal alternative having the same variance for each coordinate? 

13. Find the most powerful similar and most stringent similar tests for Problem 12 
jon formulation (4.3) replaced by (4.4) of Chapter 3. For this consider the statistic 

X) = ({211, tah" y {Vers 2a): r 

14. Define rs » jo E a of probability measures over R” and a class of trans- 
formations such that the set of ranks r(x) defined by (3.10) in Section 3.3 is maximal- 
Mvariant. 
7 bapa For the two-sample problem the Mann-Whitney tes! on 

Proportion of pairs £i, %,+5 having % < Sng (i =h m aE 1, 


Ng 


In Section (3.3) was based on > 


1 g m(n + 1) 
iffe- 
Mi 2 


i 6. Consider the randomness problem with regression alternative (3.3) in Chapiet 3. 
Ow that, against a normal alternative, the rank test most powerful for € small and 


Positive is to reject the hypothesis for large values of the statistic 


ara) = x (ei — DEZro}» 


i=l 


wi st 
here Zan: *, Zim are the order-statistic rando: 


2 to the problem of randomness with regression 
simple alternative for which the ‘error’ has 
ful similar test is to reject for large 


tin Section 1 was based on V, 
++, Ma), and 


sj, the sum of the ranks in the second sample. Prove 


that 


m variables for the unit normal. 
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17. Consider the single-sample problem of location and symmetry (2.9) in Chapter 3, 
and for simplicity take £, = 0. The problem is to test the hypothesis of symmetry 
about the origin. Let Fy(0) = pp, and denote by F, and Fy the conditional distributions 
of — X, given that X < 0, and of X, given ¥ > 0. Then the hypothesis is equivalent to 
po = 1/2, Fy = FG. Let m(x), n(x) be the number of negative, positive x’s in the sample 
and divide the sample of n x’s into the n, «’s originally negative with their signs changed 
to make them positive and the 7, z’s originally positive. The problem is treated as a 
binomial problem to test pọ = 0, combined with a two-sample problem to test Fy = Fg- 
In the notation of Example 3.1 with Fy (x) = gF), show that 


Pr {The number of X’s > Qism, and S, = Sn tt, Sn, = Sn,} 
= pol — pò™ELg (Uu) * + g Ue) 


For the alternative for which F” = gF’ + pF’? (0 < p <1, p + q = 1), show that the 
rank test which maximizes the power for p small and py = 1/2 is to reject when sı + *** 
+ Sn, > ¢, and c is chosen to give the test size ~ under the hypothesis. This test was 
originally proposed by Wilcoxen [12]. The results in this example indicate that the 
test is sensitive toward slippage to the right of the positive axis distribution under the 
assumption the median remains at the origin. 

18. For the problem of independence as formulated in Section 3.2 of this chapter, 
show by Hoeffding’s theorem that, under the alternative (3.8), the probability for the 
rank statistic is given by 


Pry {RP — nes tes RY Ex Ag Re = ri?) fey Ro = ry 
1 
~ (nly? 

where h’(u, v) = (89/0u dv) h(u, v) and Uy, +++, Uni Vi, +++, Va are two independent 
samples of n from the uniform distribution [0, 1]. Consider the alternatives for which 
Au, v) = quo + puĉv? (0 < p < 1, p +q = 1). Show that the rank test most powerful 


for p small rejects when the statistic Er}"r\*’ is too large. This is the one-sided rank 
correlation test. 


19. For the problem of independence (previous question), show that the rank test 


locally most powerful against the normal alternative with small positive p is to reject 
for large values of the statistic 


Eth (Ue), Vergy) Un, Vire)}y 


n 
> E{Za t} EZ}, 
i=l 
where Za» ***, Zim are the order-statistic random variables for a sample of n from the 
normal distribution with mean 0 and variance 1. For the two-sided normal alternative 
with p small, find the rank test locally most stringent. 
20. For Example 3.2 find the rank test that is locally most stringent. 
21. Show that, for the two-sample problem (3.1) in Chapter 3, the modified likelihood- 


ratio statistic L*(@) is a constant equal to 
(nm + tg) "14% 
mpn 
22. Construct an example having a simple hypothesis and simple alternative for which 


the likelihood-ratio method does not produce the most powerful test. For simplicity 
choose two distributions on the real line. 
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23. Construct the Wolfowitz likelihood-ratio test for the problem of independence. 
For the parametric alternative, consider a distribution having a functional relationship 
between x” and x), a relationship that is one-to-one, and by sections linear, and that 
transforms the given ordering of xs into the given ordering for the als, 
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CHAPTER 6 


Limiting Distributions 


1. INTRODUCTION 


In Chapter 5 we developed methods for constructing nonparametric 


tests. For all the examples considered, the tests were or could be put in 
the form, 


¢@)=1 if >e 
=u =c 
=0 <0; 


where (x) is a real-valued Statistic. In order to use such a test, the con- 
stants c, a must be chosen to give the test the correct size. This necessitates 
a knowledge of the induced distribution of the Statistic r(x) under the 
hypothesis. Also, to examine the power of the test we need the induced 
distribution of t(x) corresponding to each parameter value of the alterna- 
tive. For many problems these distributions are quite complicated and 
require excessive computation to tabulate them. However, in cases where 
the outcome corresponds to a sample of n from a distribution over 4 
component space, it often happens that, as n becomes large, the distribu- 
tion of the statistic t(x) approaches some simple distribution such as the 
normal or y? distribution. In this chapter we consider theorems 
concerning the approach of a distribution to a limiting form. , 

A number of theorems are standard theorems of probability theory with 
applications in many branches of statistics. We quote these theorems 
without proof, giving references where the proofs are available. However, 
others were developed primarily for nonparametric application, and we 
give the proofs for most of these. 
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2. GENERAL THEOREMS CONCERNING 
LIMITING DISTRIBUTIONS 


As indicated above, we are interested in the distribution of a real-valued 
Statistic and how it changes as a known parameter 7, in most cases the 
Sample size, becomes large. In particular, we are interested in the prob- 
ability to one side of a real number. For this it is convenient to use 
distribution functions. Therefore, consider a sequence of distribution 
functions {F (x); n= 1,2,++°}: 

A sequence of distribution functions {F,(x)} is said to converge to a 

distribution function Fæ) if for each point x at which F(x) is continuous 
(2.1) lim F,,(«) = F(x) 


n>a 
If a random variable has the sequence of distribution functions {F,,(2)} 
Satisfying the definition, then we say the random variable has asymptotically 
the distributions given by F(x). 
The reason we require the limit to hold only at points of continuity is 
best illustrated by an example. Let F,,(«) be the distribution function of 
the random variable which takes the value I/n with probability one: 


F,(#) = 1 if «> ljn 
=0 < I/n. 


mes close to the probability 


AS n— co, the probability distribution beco y the 
it has distribution function 


distribution having all probability at x = 0; 
F@=1 if 220 
=0 <0, 


It is easily seen that lim F,(x) = F() for all « other than 0, the point of 
discontinuity of F(x); at that point we have 

lim F,(0) = 0 4 F(O). 

no 


Thus, if we were to define convergence on the basis of the limit holding for 
However, when the probability 


a . . 
._“» We would exclude this simple case. ; i 
is the sort of ‘convergence’ we 


'Stribution has di ities. thIS i 
s discrete probabilities, this 1 r 
MANE Yo consider, and hence we require the limit to hold only at points of 


Continuity, 
The definition of convergence does not in general imply that 


f ar MEJ) re Í A (x). 
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To require that this limit hold for all Borel sets on the real line produces a 
much stronger definition of convergence than introduced above. For an 
interesting discussion of definitions of convergence and relations between 
them, see Scheffé [13]. 

An example of the convergence of distribution functions is obtained from 
the concept of convergence in probability of random variables. A 
sequence of independent random variables {X,,} is said to converge in 
probability to a constant c and is written 


(2.2) plim X,=c 


na 


if the probability in any neighborhood of c approaches 1 as n approaches 
infinity: that is, if, for all ô > 0, 


(2.3) lim Pr {c — ô < X¥,<c+6}=1. 


nD 


It is easy to show that this is equivalent to the convergence of the corre- 
sponding distribution functions to the distribution function of the random 
variable taking the value c with probability one (Problem 2). Also, it is 
easy to show by Tchebycheff’s inequality that, if E{¥,,} > c and o > 9, 
then p-lim X,, = c (Problem 3). 

Our first theorem relates the convergence of distribution functions to the 
convergence of the corresponding moments. The moments of the distri- 
bution functions F,(x) are defined by 


+o 
(2.4) WP = |e dF) 


for r= 1,2,-++, When the integral does not converge, we say the 
corresponding moment does not exist. 


THEOREM 2.1. (FRECHET AND SHOHAT). If for each n the moments 
(us r= 1,2, ++} of F,(@) exist and if limp” = p, (r= 1,2 +°) 


n—=> o 


then the 4, are the moments of a distribution function. Also, if there is 
only one distribution function F(x) having the moments x, then the 
distribution function F(x) converges to F(x) as n—> o. 


Proof. See Kendall [14], p. 110. 
To apply this theorem we need to know whether a set of moments 


determines a distribution uniquely. The next theorem provides a criterion 
which, if satisfied, establishes the uniqueness of the distribution. 
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THEOREM (2.2). If F(z) is a distribution function with moments 
{u,; r= 1, 2, » + +}, then the absolute convergence of the series 


(2.5) ee 

2 Bre 
for some z > 0 implies that F(x) is the only distribution function having 
moments {12,}. 


Proof. See Cramér [15], p. 176. 


It is quite easy to show that the moments of any nor 
Satisfy this criterion and hence that the convergence of moments to normal 
Moments implies a limiting normal distribution (Problem 1). 

There is a multivariate analog to each of these theorems. Let 


‘mal distribution 


Fr@***, x,) be a distribution function over R*. The moments are 
defined by 

2: n lke 7 oe Y 

(2.6) Bee = fa seo dF, tt A 

for hy = e Lf one of the integrals does not converge, We 


Say that the corresponding moment does not exist. Theorem 2.1 using the 
2,-+*} can be shown to apply 


~ of moments aN [rst te = La 

oo aly to this more general case. There is an extension of Theorem 
= but for our purposes it suffices to note that the moments of a multi- 

variate normal distribution satisfy the criterion and hence determine the 

distribution uniquely. 

tee next theorem relates the convergence t 

iene ie of the corresponding characteristic functions. 

unction of a distribution F,(x) is defined by 


sia Palt) = ý aM dF,(2) 


of distributions to the con- 
The character- 


The characteristic 
je™| = 1. We 
ons. 


fo ? 

fus real t, and is in general a complex-valued function. 

ee always exists since the integrand is bounded: ; 
Ve the following uniqueness theorem for characteristic functi 


THEOREM 2.3. (Lévy). If two distribution functions have the same 


c A PR 
aracteristic function, then they are identical. 


Proof. See Cramér [15], p. 93- 


Thi z 
ist Us theorem shows that there is a one 
s: o ribution functions and characteristic functions. ~; 
Ws that in a certain sense this correspondence 1S continuous. 


-to-one correspondence between 
The next theorem 
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THEOREM 2.4 (LÉVY AND CRAMER). For the sequence of distribution 
functions F(x), Fa(=), +-+ with corresponding characteristic functions 
f(t), b(t), ***, a necessary and sufficient condition that the distribution 
functions F(x) converge to a distribution function F(x) is that, for every 
real /, the sequence ¢,,(¢) converges to a limit 4(¢) which is continuous at 
t= 0. (t) is the characteristic function of F(x). 


Proof. See Cramér [15], p. 96. 


For multivariate distribution functions F,(2,, +++, æ) we can define a 
characteristic function, 
k 


CD Palins) f exp (> zn) aine 


j=1 
for (f, ***, t4) E R*. With this definition Theorems 2.3 and 2.4 are valid 
for multivariate distributions. 

We quote now a theorem which does not directly concern limiting 
distributions but which can frequently be used with earlier theorems in 
this section to find the form of a limiting distribution. 


k 
THEOREM 2.5. (CRAMER-WoLD). If the distribution of S 1X; is 
k 1 
identical to the distribution of 2 LY, for all (4, +++, l4) € R*, then the 


I 
distribution of (4, - + +, X;) is identical to the distribution of (Y, °°" Y). 
Proof. See Cramér [20], p. 105. 


To complete this section we have three theorems which enable us to 
derive from the limiting distribution of one random variable the limiting 
distributions of related random variables. The first theorem is given for 
real-valued random variables, but it has an obvious multivariate analog. 


THEOREM 2.6. If (X, Y,), (Xa Yo),*** is a sequence of random 
variables such that the sequence Xj, +--+, Ys, ++- has limiting distribution 
F(x) and the sequence Y,, Y», + ++ converges in probability to a constant 
c, then X, + Y, has limiting distribution F(a — c). Further, if c > 0, 
then Y,,Y, has limiting distribution F(æ/c) and X,/Y,, has limiting 
distribution F(xc). 


Proof. See Cramér [15], p. 254. 


THEOREM 2.7. If (X, Y3), (Xa, Y3), +++ is a sequence of random vari- 
ables over R? which has the limiting distribution of a random variable 
(X, Y) with mean (0, 0), if d„ isa sequence of real numbers with lim d, = 0, 


no 
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and if h(x, y) is a real function having a total differential at (0, 0), then, as 
n— œ, the distribution of 


(2.9) d= Wd, Xw da Yn) — A(O, 0)] 


converges to the distribution of h, ¥ + A Y where 


(2.10) _ 753 , ee aie, 0) 


Proof. The proof is essentially that of Cramér’s theorem on page 366 in 


THEOREM 2.8. If (X;, 4), (Ya Yo), +++ is a sequence of independent 
random variables which has the limiting distribution of a random variable 
(x, Y), and if f(z, y) is a continuous function, then the limiting distribu- 
tion of f(X,, Y,,) exists and is the distribution of /(X, Y). 


n? 
Proof. Let b be a point of continuity for the distribution function of 
J(X, Y). We shall prove that 
(2.11) lim Pr {f(Xy, Yu) < b} = Pr {/(X, Y) < b}, 


n> 


and this establishes the theorem. 

„Let F(x, y) be the distribution function of (Xw 
distribution of (Y, Y). Cramér’s theorem (on page 74 [15]) est 
that 


Y,) and F(x, y) be the 
ablishes 


lim f pe, o) dF, 9) = | jen area 
R? JR? 


n=» 


for bounded continuous functions A(x, y). For positive €, we define two 
functions 4 (æ, y), he es) 
hz (a, y) = 1 if f(y <b 


_b+e-f@” if n<fensbte 


e 
= 0 if b + e< f(E, Y), 
he, y)=1 if fæ) <b E 
bosen if b—e<f@yse 
e 


—0 if h< fy) 
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Since these functions are continuous and bounded, we have 


(2.12) maf hy (æ, y) dF,@Œ, y) = Í hè (Œ, y) dF, Y), 
PEN ini R? 

(2.13) i hr (x, y) dF (€, y) = Í hz (x, y) dF(e, y). 
aad R? 


But we have 


[fe G0 AF. < Prin Y) S ho AFD 
R? JR 


and 
[eiT © AFC) < Pr (706 Y) < b} < | He, aPC. 
JR? R? 


Therefore the limit points of Pr {f(X„, Y„) < b} are contained between 
the two values (2.12) and (2.13). However, by virtue of our choice of b 
as a continuity point of the distribution of f(X, Y), it follows that (2.12) 
and (2.13) can be made arbitrarily close to Pr {f(X, Y) < b} by choice of 
e small. This establishes (2.12) and hence the theorem. 


Theorem 2.7 and 2.8 are stated for bivariate distributions but they 
remain valid for multivariate distributions. 


3. CENTRAL LIMIT THEOREMS 


In this section we consider a number of theorems which are central to 
the theory on limiting distributions—central-limit theorems. However 
we first quote a related theorem mentioned in Chapter 2. 


THEOREM 3.1. (KHINTCHINE). If Xis Xə, * +- are independent random 
variables, each with the same distribution function F(x), and if the mean 
H’ of F(x) exists, 

w = [earey, 
n 
then X= n! = X; converges in probability to 4’ as n—> œ. 
1 


Proof. See Cramér [15], p. 254. 


This theorem states that, if we pick a sufficiently large value of n and 
examine probability statements, then most of the probability for ¥ will be 
in a small neighborhood of x. In sucha case X is said to satisfy the weak 
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law of large numbers. Also of interest is to inquire what happens in a 
probability sense to ¥ as n increases, and in particular, what is the 
probability that 

n 
lim n! > X= pel 

1 


n> 


A sequence for which the probability of convergence is one is said to obey 
the strong law of large numbers. For this, we have a generalized form of 


Khintchine’s theorem. 


THEOREM 3.2 (KoLMoGoroy). If X,, X2,°** are independent random 
variables, each with the same distribution function F(x), and if the mean 


K’ of F(x) exists 
gw =fz dF (x), 


then with probability one 


n 
lim n 2 Xe Be 
n> T 

Proof. See Feller [9], p. 208. 

If in Theorem 3.1 we assume in addition that the second moment of the 
distribution exists, then a stronger statement can be made which gives the 
imiti 5- z , 
limiting distribution of ¥ in the neighborhood of x’. 

THEOREM 3.3. CENTRAL-LIMIT THEOREM (LINDEBERG AND Livy). If 
Xi Xa +++ are independent random variables each with the same distribu- 
tion having mean x’ and finite variance o’?, then 


n 

, 

nt pa HR 

-e 
P= 


7 , m 
ns asymptotically normal with mean 0 and variance o~. 


Proof. See Cramér [15], p. 214. 


If we assume that third moments exist, then we can dro; 
that the Y, have the same distribution; we have 


p the assumption 


UNOFF). If Xr Xatt 


THEOR uli M (LIAP 
EM 3. L-LIMIT THEORE! ji 
om CENTRA variance of, and 


are independent random variables, if X; has mean ki 
third absolute central moment p?, 


pi = E{|X: — rl’) 
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and if lim p/o = 0 where 


then 


is asymptotically normal with mean 0 and variance 1. 


Proof. See Cramér [15], p. 216. 


Note. Often we describe such a convergence by saying that DX; is 
asymptotically normal with mean Xj; and variance o®. (ø depends on n.) 

These theorems remain valid if the random variables have a multivariate 
distribution over R*. Also, the last theorem remains valid if, for each 7, 
there is a fresh group of random variables rather than the first 7 from a 
given sequence. We illustrate these two extensions by the following 
bivariate form of the central-limit theorem. 


THEOREM 3.5. CENTRAL-LIMIT THEOREM (BERNSTEIN). If, for each 7, 


{Yaa Yards (Xart Yann} is a set of independent random variables 
over R? for which 


E(X) = E(Y n) = 0 (i= 1,+++,%), 
and 
lim x(n) = 00, 
no 
lim »~*?(n)p3 = 0, 
no 


img 2, MG? = My 
for (i, j) = (2, 0), (1,1), (0, 2) where 
Hy? = E(X i Yih 


Pra = max {E{| X,4|°} E{| Yne|*}}s 


o 3 
ph = > Phe 
a=1 
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then 


(pane S x yin Ya) 


a=1 a=1 
has . z 
ren asymptotically the bivariate normal distribution with means 0 and 
ariances j. 
P 7 
roof. See Hoeffding and Robbins [4]. 


For ; 
omens version of these theorems having the exi 
s replaced by the existence of absolute momen 


fe 
or some ô > 0, see Uspensky [16]. 


stence of third absolute 
ts of order 2 + ô 


4. CENTRAL-LIMIT THEOREM FOR 
DEPENDENT VARIABLES 


ich have somewhat the 
tion but which concern 
developed with a view 


o hp we consider two theorems wh 
dependent x central-limit theorems of the last sec 
toward no random variables. These theorems were 

lat X fang’ applications. a “ ; 
Sequence > as be a sequence of random varia wie p e w Me 
independe f random variables is m-dependent if (M0 a is ae 
mor nin OF (a Kes Ds provided s — ">| In such a case, i 
Sequen aS consecutive 'Y’s are removed, the two remaining portions of the 

ce are independent. Also we define 


m—1 
A,;=2 = cov {Xiri Kiem 
j=0 


I 


} + var PERSI 


Whe è 3 
re var, coy, designate, respectively, variance, covariance. Then 


usin š 
8 this notation, we can state the first theorem- 
If (a) an m-dependent 


THEoRE 

M 4.1 (HoEFFDING AND ROBBINS). 

Se 

par Xis Xa +> satisfies E{X} = 0 E{| X: < R< for 
> 2, +- and (b) the limit 


(4.1 1x 
-1) im 2 >, Aus= 4 


poo Ph=1 


ptotically normal with mean 0 


exists un; z j 
S uniformly for all i, then > X; is asym 
1 


a ; 
Nd variance nA. 
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Proof. Choose an « satisfying 0 < « < 1/4, and let k = [n*], v = [n/k] 
where the brackets designate the largest integer less than or equal to the 
bracketed number. Then we have n= kv +r withO<r<k. Our 


n 
method of proof will be to break > X; into v independent groups of 


I 
k — m X’s and a remainder term, and then apply the univariate form of 
the central-limit Theorem 3.5. to show that the sum of these » groups has a 
limiting normal distribution. 


Let 
S= X Het An 
(4.2) =S +T, 
where 
(4.3) S = U Het U 
and 


U; = Xuana + Xum H't H Xirom 
»—1 

T= > (Xama H E Xa) + nma H + Koe) 
i=l 


We shall show that 771/285" has a limiting normal distribution with mean 0 
and variance A and that n~/*T approaches 0 in probability. The proof is 
then completed by applying Theorem 2.6 to show that 


nS = n"2S' + nT 


is asymptotically normal with mean 0 and variance A. 
First we consider n™™?S'. Since 
12 = 
ries’ = ye > kU, 
5 izi 


and vk/n > 1 as n —> ©, it is sufficient, again by Theorem 2.6, to show that 
ya > k20; 
1 


has the limiting normal distribution. Now for s > m we have 
Boat e + Kad) = BS +10 + Xana 
+ 2E(X 1 pessi Xizs-dX ist F E(X }ts} 
= Eea F Fa y + Aam 
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where the second step follows from the fact that E{X,X,,;}—=0 for 
j' >m; then by induction 


sm 
EX Xx oe T Xa) = EX ee Xizm)?} t > Are 
hal 
Applying this to find the variance of U,, we have 
k—2m 
E(U?} = EX asa Hte H Xaeml?} + = Atikan 


i=l 
But, since y?? is a convex function, we have by Theorem 2.4 in Chapter 2, 

E| X,|3} = HR) = LETH, 
and hence 

E{X?} < [Et] X;| 3° < RP. 

Also by the Hélder inequality, page 238 in Monroe [17], we have 

E{| X,X;|} < [E{| x°} : E{| xa. 

= 
Therefore 
E(X cina H't + Xareem)?} < mR, 

and 


k—2m 


| E{U?} — > Arien | SPR? 
h=1 


By summing for i= 1,--+,» and dividing by v» and by k, we obtain 
k—2m 


I~ 1 = i 
= 120,2} — — (k — 2 = Y dema Ee 
|5 2 z UD} — jy E — 2m) D eq D, Awn Sim 


As »—> 00, k—> 00, and from (4.1) it follows that 


k—2m 


1 
k — 2m Š 


Aiit 


approaches A uniformly; therefore 


Ix 
im — > E{(k4?U)2}} = A> 0. 
(4.4) lim | > a = 42 
Now k-¥2U,, +++, k-1?U, are independent and have a distribution 
depending onn. For the application of Theorem 3.5 we have shown above 
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that the average variance approaches a finite limit A. The third absolute 
moment condition will be satisfied if we show that 


(4.5) max [E{|k-¥/20,| 9}, +++, EJU, | 33] = of). 
Applying Hélder’s inequality again, we have 
E{| X: X; Xr} S [E{| X: [PAEL] XX 
< [Ef] Xi} EGP ELA 
LRE 
Then 
(4.6) E{|k?U,|9} = eee 


k-m 

= X i-th 
h=1 

< kk — mR? 

< RARS, 


j 


From our definition of k, we have 


k~n" 
~ Ok; 
therefore 
k~t —a) 
= o(r3) 
and 


k23 = o(p'/2), 


This, with (4.6) proves that the third absolute moment condition (4.5). 
Theorem 3.5 then establishes that 
ye X fy, 
2 
and hence n-/2S’ are asymptotically normal with mean 0 and variance A. 
Next we consider nT. For k > 2m the brackets in the definition of 
T are independent. Therefore, for large n, 


nE{T?} = IS EXLX maa ot G Xal} 
i=l 


+ EXLX m4 = a a za 


<n [(v — 1)m2R? + (k + m)? R?]. 
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From the definition of k and v, 
k~n, 


n 


p~- ~n; 


hence 
nET) = Ow) + O(n) 
= O(n) + O(n?) 
= 0(1). 
The mean of 2-27 is obviously zero. The variance, n~'E£(T?), as we have 


just shown, approaches zero. Hence n-"?T approaches zero in probability. 


This completes the proof. 
A sequence of random variables, X4, Xa, °° *, is called stationary if the 


joint distribution of X;, Xar X;,,., is independent of i for all r. This 
gives the following simplification of Theorem 4.1. 


THEOREM 4.2. If Xj, Xa +e is a stationary m-dependent sequence of 
random variables with E(X,) = and Ef{| X, |°} existing, then, as 7 00, 
n 


the limiting distribution of ry X; is normal with mean n'/?~ and 
variance I 


(4.7) A = var (X?) + 2[cov (X1 X) + +++ + cov (X Xm+1)]. 


There are also multivariate extensions of these theorems and the method 
of proof is essentially that used above. We quote the bivariate form of the 
theorem for a stationary sequence. 


THEOREM 4.3. If (Xp Yi), (Xo, Yo), **> is a stationary m-dependent 
sequence of random variables over R? which has E{X%} =0, E{Y,} =0 
and E{|X,|*}, £{| ¥:[9} existing, then as n —> œo, the limiting distribution 


of 
n n 
G2! > Apna Y,) 
Sneed 


is normal with mean 0 and covariance matrix 
A B 


B C 
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where 
A = var {1} +2 > cov (KX) 
jal 
(4.8) B = cov {X1 Y1} + 2 [cov {X4 Yis} + cov {Xr Y1}, 
j=l 
C= var {%} + 2 > cov {Y, Yis) 
j=l 
Example 4.1. Consider a sequence Z,,Z,,--- of independent and 


identically distributed random variables having E{Z,} = 0, var {Z,} = 1, 
and E{|Z,|*} finite. We can define a 1-dependent sequence Xj, Xp +t 
by the equation ¥; = Z,Z,,,. Obviously this sequence is stationary, and 
we have E{X,} = 0, var {X,} = 1, and E{|X,|%} finite. We calculate A: 


A = E{X3} + 2E{X,X} 
= E{ZiZ3} p 2E{Z,Z3Z5} 
=1. 
By Theorem (4.2) the random variable 
n WZ Ze + ZZ + +++ + ZnZn) 


has a limiting normal distribution with mean 0 and variance 1. 
If we assume further that E{Z}} is finite, then the bivariate sequence 
(ZiZa Zi — 1), (ZZ, Z3 — 1), +++ satisfies the conditions of Theorem 


(4.3), and 
n : 
aia > ZiZapn e > (z —1) 
I 1 


has a bivariate normal limiting distribution. From this limiting distribu- 
tion we can derive the limiting distribution of 


1/2 rla e E ZnZn 
n 3 2 
Bape Ze 
by applying Theorem 2.7. Let d,, = n™? and h(x, y) = x/(y + 1). Then 
hy = 1, hy = 0, and h(0,0) = 0. As n—> œ, the limiting distribution of 


n 
x? > ZZi 
1 


z 
I+n7 Sez — 1) 
1 


W, 


> 


KK =m 
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is the same as the limiting distribution of 
n n n 
hyn? F ZZizy + hn? > (Z — 1I) =r? > Ti 
1 e] 1 


which is the normal distribution with mean 0 and variance 1. 

Problem 5 is to show that a serial correlation coefficient has a limiting 
normal distribution, and it illustrates the Hoeffding and Robbins theorem 
used in conjunction with a trivariate extension of Theorem 2.7. 

m-dependent sequences of random variables often arise in the following 
manner. Let Z,,Z,*** be a sequence of independent and identically 
distributed random variables. If /,(2,, +--+, 2,) is a real-valued statistic 
defined for i = 1, 2, +++, then the sequence A (Zy, ***, Zi) (Zo tt Zra) to 
is obviously (k — 1)-dependent. The Hoeffding and Robbins theorem 
gives conditions under which the sum 


A Zina) 


i=1 


has a limiting normal distribution. However, for some nonparametric 
applications we are interested in knowing the conditional distribution of 


the sum 


SiGe 0; Zipre) 


i=l 


given the order statistic for the first n 2’s. This is the distribution of the 
sum under the n! permutations of a given sequence of 2’s, CAREN AA 
For this, we slightly alter the definition of the sum to put it in a ‘circular’ 
form, but the results derived apply equally to the form above. Any z 
having index greater than 7, 2,,;, is taken to be z; A statistic of this form 
has been called a serial statistic by Ghosh. 

To complete this section we quote a theorem by Ghosh. This theorem 
needs stronger assumptions than the Hoeffding and Robbins theorem and 
proves that, as n —> ©0, the distribution function of a serial statistic under 
equally likely permutations of the z; converges in probability to a normal 
distribution function. 

For this let Z}, Z,*** be a sequence of independent and identically 
distributed random variables, and let fi(2,,°**s 21)» J21 * * "> Z4), +*+ bea 
sequence of real-value statistics such that 


(4.9) E{| F(Z, iie ZA] < ls 
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holds for all i=1,2,--- and for each s=1,---. Ghosh defines a 
serial statistic, S(z,***, Z„), by the equation 


12 
(4.10) SZ. °° *s Zn) = = > fin 11) Zines), 


where, for this sum z,,; (7 > 0) is taken to be z,, and we have the ‘circular’ 
definition. We consider the distribution of S(z,,---, z,) under equally 
likely permutations of the set (z4, * **, z,). This is, of course, the condi- 
tional distribution of S(Z,,---+,Z,), given that the order statistic 
{Z,,°+*,Z,} takes the value {z,,-+-, z}. Expectations under this condi- 
tional distribution we designate by Æ’. Then for the mean and variance 
of S(%,°***, 2,) we have 


My, = E'{S(Zyy* * ZF 


it n 1 
E aE DS Me oD 
My = E'{(S(Z;,** +, Zn) — My)*}, 


where P denotes summation over all permutations ( j,," * *, Jy) of k integers 
selected from (1, +++, n). 

Also for the statement of the theorem we need to define two symbols 
liin and Hon. M, and Mg are functions of (z4, ` * *, z,) and hence have a 
probability distribution corresponding to the random variable (Z,, * + +, Z,) 
Han and Ho, turn out to be related in a probability sense to M, and nM, 
respectively. 


Pan = Ly Zp Zd} 


=| 


i 
Han = A > E{f(Z;, 9 Zinn DIZ > Ziad} 


|i-j|<k 


= > = Et fZ, 0 Zd SAZ "s Zax) 


li—j| <k 


sa 15 i Z Zaa s Z 


ij=L a 


1 2 
ae f Senz oo zo| ; 
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where again Z,,;=Z; (j> 0), and where og =œ (a, 8 <k), and 
c, (y # P) is greater than k. This last statement means that the suffix cg 
takes the value « and all other o suffices in the expression take values 
greater than k. 

THEOREM 4.4. (GHOsH). If Z}, Zə, -++ is a sequence of independent 
random variables, each having the same continuous distribution function, 
if (4.9) holds and if lim inf &ə„ > 0, then, as n—> oo, the probability 


n= 


approaches one that the permutation distribution function of 


1 n 
-12 | n EEY = 
man [TD leu ana) — M| 


differs by less than any preassigned amount from the normal distribution 
function with mean 0 and variance 1. Also M, — fy, and nM — Hon 
converge in probability to zero as n —> o. 


Proof. See Ghosh [12]. 
Note. The permutation distribution function for 


n 1 n 
My? [; D Ner vigia m| 


depends on the order statistic for %,°**,2, and hence is a random 
variable in terms of Z,,°**,Z,. This random distribution function 


converges in probability to the normal distribution function with mean 0 
and variance 1. 

Ghosh [12] has an extension of Theorem 4.4 to cover the case of vector 
functions f(z, ***, Zn). Also he considers the distribution of S(Z,, +++, Z,,) 
when the Z’s are not independent but have the distribution of a Markov 


process. 


5, THE LIMITING DISTRIBUTION OF U STATISTICS 


In Section 2 of Chapter 4 we defined U statistics and referred to a 
theorem by Hoeffding which gave the limiting distribution of U statistics. 


In this section we prove Hoeffding’s theorem. 
Let 2(./) be a measurable space, and consider the sample space 


& =X". Corresponding to any statistic f€ '*'s Xm) defined over 
Z” (m <n), we define a U statistic U(x, °°, x,) over 2", 


1 
GD Tey ya nn — 1): ** (i —7SE 5 Sno > an) 
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where P, indicates that the summation is over all permutations (0%, * * *, %m 
of m integers chosen from (1, ++ +, n). As proved in Chapter 4, we can 
always write the U statistic in terms of a symmetric statistic f * (%4, * * *, Em) 
1 
6.2 Ulen 5, 2n) = Ta DL Ea > Tan) 
ind 
m 
where C, indicates that the summation is over all combinations (0, ** *, %m) 
of m integers chosen from (1, +++, n) and where 


1 
J E s En) = Df Cay * 9 Fag) 
“Pm 


Let Xu’ +) My be zt independent random variables having the same 
probability measure P over 2(). We first consider the variance of a 
U statistic. Assume that E{({*(X4, ` ++, X;,))?} exists. Then, of course, 
the first moment also exists, and we let 
(5.3) EL f*(Xq, 77+) Xa) = 1 
From the theory of Chapter 4 it follows then that 

E{U(X,, * ++) X} =N. 
We define a function f ¥ (€; * * *, £e) by taking the conditional expectation of 
Piao Naar Ln), given ty, °° *, Let 
(5.4) fi@ue +8) = EL f eo *s es Xa Xn} 
forc=1,--:,m. To obtain a simple expression for the variance of U, 
we need the variances of these functions f¥; let 
Lo =0 
C, = var{ f#(X, +++, XD} (c= 1,+°*,m). 
The variance Č, is the variance of f(y, °° *, X,,) and exists by reason of 
our assumption above. By Theorem 2.4 in Chapter 2 it can be shown that 
the other variances satisfy “e < ¢,, and hence are finite (Problem 7). 

Let dys °°", %m and fy, °**, Bm be two sets of m different integers chosen 
from 1, ++, n, and let c be the number of integers common to the two sets. 
Then, using the symmetry of the function f(a, * **, tm), We find the co- 
variance of f*(X,,5°**: Xan) and f*(X pp > Xan): 
cov (Xap > KaD S* Xap Xa) 

=A a y Ea m °° Xs) — 1} 
= EISE (Xp o X) — MSE > X) — n} 
= be 


(5.5) 
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Then, for the variance of U, we have 


ny 
var (U s X= (MY Deo fag os Kad Say Xp 
where the summation is over all combinations (a, ***, &,,) of m integers 
from (1,---,7) and all combinations (Êi, ***, Êm) of m integers from 
(1,+++, n). We have 

Var {U(X +++, X,)} 


= or < pl fey k — mt 
~ (n m)\c}\m—c} °° 
c= 
n\-2~(n\(m)\(n—m 
i i (2) ibd ( c ) j — ”) be 


c= 


-1 m 

fn m) (n—m) > 

~ \m c}\m—e}°? 
i 


c= 


m)\c}\m—e 
(By, © **, Bm) having exactly c common integers. 
Using the notation introduced above, we now give Hoeffding’s theorem. 


THEOREM 5.1. (HOEFFDING). If X4,°*+, Xn are independent random 
variables having the same distribution over Z(./), and if f*(a,, +p) is 
a real-valued symmetric statistic over ZX” and has expectation 7, and 
finite second moment E{[f*(%j, °° +. Xm} < ©, then as n —>o, the 
limiting distribution ml2PU(X,, +++, Xa) — n], where 

1 *, eee 
(5.7) U (ty * Ba) = Ç j > lapt s Bag) 


where (") hs (; ie ") gives the number of pairs of sets (0%, ***, %m) 


n 


m 


is normal with mean 0 and variance m?4, [% is defined by (5.5)]. 


Proof. By the introduction to this theorem we know that 


E{LS*(X, ++) Xm) PF} < implies that &,***, Ém exist finite. 
Our method of proof is to show that the random variable 
Z,, = n'2[U(X, ++, X,) — 7] is asymptotically equivalent to the random 


variable Y defined by 
n 
(5.8) Yy =m? Y AX — a) 
a=1 
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First we show that Y,, has asymptotically the normal distribution with 
mean 0 and variance m®¢,._ By our introduction we know E{ fè (X)} =7 
and var {fi(X,)}=G. Also A(X), A(X) *** is a sequence of real- 
valued, independent, and identically distributed random variables. By 
the central-limit Theorem 3.3, Y,, is asymptotically normal with mean 0 
and variance m?,. 

Now we show that p-lim(Y, — Z,,) = 0. For this it suffices to show 


that E{(Y,, — Z,)?}—> 0 (See Problem 3 and remarks on p-lim in Section 2.) 
(5.9) ELY, — Z,)?} = E{ Yn} + E{Zi} — 2E(Y,Z,} 

By our results above we have 

(5.10) ELY = mG: 

From (5.6) we obtain 


var {U(X,, ‘++, X,)} 
= lO) =A) +6) 623) 0-4 
(a 
Benig) 


therefore 
E{Z?} = n var Uy +, X,)} 


(5.11) = mt, + o(4) z 


We evaluate E{Y,,Z,,}. 


BUY,Z,} = mete a X) — al > 80) = a] 
aml 


= w Rage x) —m Suter — ai} 
a=1 


ass 


oe - So BUA Xa) — MA) — l 


n \ Cr a=1 
m 
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The expectation in the expression above is zero if æ is not equal any œ; and 
is ¢ if « is equal one of the «; For a fixed æ the number of sets 


a i) . Therefore, 


{aas +4, £m} containing « is (; _ 


(Y, fo a=] t 
z BYZ ey" Ja 


=m% 


Equations (5.10), (5.11), (5.12) imply that the expression (5.9) has limit 
zero. Therefore p-lim (Y,, — Z„) = 0. 


n> 


We have Z, = Y„ + (Za — Y,). By Theorem 2.6 and our 
results above, it follows that Z,, is asymptotically normal with mean 0 and 
variance m2. This completes the proof. 

Hoeffding also proves some inequalities among var {U} and the ¢’s. 
These can be useful for the application of the theorem, and we quote them 


without proof. 
THEOREM 5.2. The variances &, Čo, ** 
inequalities 


(5.13) 


+, Čm defined by (5.5) satisfy the 


ba 


us 


es 


ala 


where 
l1<c<d<m. 


THEOREM 5.3. If Xj, +*+, Xn are independent and identically distributed, 
then the variance of a U statistic satisfies 


n? m 
(5.14) Tg Svar {Un} Sy bm 


n var {U,,} is a decreasing function of n, 

(5.15) (n + 1) var {Un} Sm var {Un} 

which takes its upper bounded m,, for n = m and tends to its lower bound 
mC, as n increases: 


(5.16) lim n var {U,,} = m6. 


n> 


Using this last theorem with Theorem (2.6) we obtain immediately 
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THEOREM 5.4. If the conditions of Theorem 5.1 are fulfilled and if 
¢, > 0, then the limiting distribution of 

aI 
oy, 
is normal with mean 0 and variance 1. 

In [5] Hoeffding generalizes the results above to obtain the limiting 
distribution of a vector U statistic. We indicate the form of these results. 
Let U=(U%,--:,U™) be based on the symmetric statistics 
SPP mdse SOG, * m) If Gj) is the covariance 
betweenf#((X) andf #*(X), using the previous notation, then n/°(U — n) 
has a limiting multivariate normal distributions with means 0 and co- 
variance matrix || myn,f* ||. (Assuming of course that second moments 
of the f*'(X,,+-+, X,,,) exist). To derive the covariance of two U 
statistics is given as Problem 10. 

Also Hoeffding treats the case of X’s not identically distributed and 
obtains for Ustatistics a generalization which corresponds to the Liapounoff 
form of the central-limit theorem. We quote this second generalization 
after developing some necessary notation. 

Let Xotin, X, be independent random variables over 2(.%7/), and 
suppose they do not necessarily have the same probability distributions. 
Corresponding to a U statistic, 
6.17) Uen + 29) = DI ays Hay 

(") sa 

with f* symmetric, we define 

Rhisng = Bf Kags La Dh 

broha Mts * T) = EU Ep Ber Xpo s Xp,,_Ihs 
bier By Bmock Pr" Yme 
= COV Efe Pe Xap > Xa) Sti rym Mays > Kaa) 
c!(m — c)\(m — c)! 

n(n — 1)++*(2— 2m+e+4 1) 


2 (A raD bi Bind Yi Pa 


ben 


where the sum is extended over all disjoint sets {o4 ++ -, &e}, {B1, © *s Bm=0) 
{Yi © s Yc} Chosen from (1, +++, n). Then it is straightforward to show 


that 
“(m\ [n— m 
(5.18) var {U} = -r z= (o) (r ae ") ben 


1 
(7) 
c=1 
m) 
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We also define a function g4;,)(%) 


1 
(5.19) gio) = ——_ 3 [fi Tarba (%) = aay) 


n—1 
m—1}) * 
where the summation is over all sets (4, ** *, 8,,-1) chosen from the first 


n integers excluding the integer v. We now quote Hoeffding’s theorem for 
the case of X’s not necessarily identically distributed. 


THEOREM 5.5. If Xpt, X, are n independent random variables, if 


ELS" (Kap's Kan} <A 
for all «,,° °°, &,, if 
E{| gX} < 0 


for v = 1,--+,n and if 
n 


> lel 


. r=1 
lim = =0, 


me S ig 


»=1 


then, as n —> œ, the limiting distribution of 


U — E{U} 
[var {U}? 


is normal with mean 0 and variance 1. 

Another extension of Hoeffding’s theorem has been proposed by 
E. L. Lehmann. Problem 8 suggests a method of proof following the 
pattern of proof for Theorem 51, Let.Xy+*, Xn be independent and 
have the same distribution over Z (s42), and let Y}, `, Y,,, be independent 
and have the same distribution over %(2). Then we have 


THEOREM 5.6. If f*(2,°°*s®m3 Yar" "> Ym,) is a real-valued statistic 
symmetric in the 2’s, symmetric in the y’s with expectation 7 and with 
R,- n 
—> œ such that lim + exists, 
n 


finite second moment, and if n, < 7, and m 
2 


then n!/2(U — n), where U is given by 


1 
(5.20) v= aA] > Sage "2 aig’ Yor” A 
Cc 


ny Ng 
m Mz 
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and the summation is over all combinations (a, * * *, ,,,) from (1, + + +, 7) 
and all combinations (f,, ** *, Êm,) from (1, + + +, nọ), has a limiting normal 
distribution with mean 0. 


Note. The variance of the limiting distribution together with sugges- 
tions for a proof are given in Problem 8. 


EXAMPLE 5.1. In Chapter 4 we considered the estimation of moments 
and cumulants by means of U statistics. By Hoeffding’s theorem we now 
know that, if the second moment of a symmetric kernel exists, then 


nv2(U — E(U)) 


has a limiting normal distribution. Actually, as is easily seen, it is 
sufficient to have a finite second moment for any kernel. By Theorem 2.7 
we can then infer that many functions of moments and k statistics have a 
limiting normal distribution. 

Also in Chapter 4 we defined Gini’s mean difference for n real variables, 


Yrs Unt 


1 
: [a =a. 
(521) aa Dy le — vol 
If Y}, +*+ *, Y, are independent and identically distributed random variables 


having distribution function F(x), then the mean of the induced distribu- 
tion of d is 


(5.22) Aa | | lv: — val dF 0) dF), 


and the variance by (5.6) is 


2 
var u= TE p 2a 2) + %2], 


where 


& 


f f (4, — 4s) dF(u)? dF(y,) — A? 


(5.23) b = ți (Yı — Y2)? dF(y,) dF (y) — A? 
= 2 yar {Y} — 4?. 
Values of var {d} for several distributions have been tabulated by U. S. Nair 


[18]. By Hoeffding’s theorem, y/n(d— A) has a limiting normal 
distribution provided the variance exists. 
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EXAMPLE 5.2. Some interesting applications of Hoeffding’s theorem are 
obtained for statistics based on ranks. First we introduce some notation 
which simplifies the use of ranks. The sign function s(x) is given by 


s(x) = —1 if 20 


(5.24) =0 =0 
= > 0. 
Also let 
c(u) = 0 if #<0 
= ; =0 
(5.25) 
=i >0 


= ; [1 + s(a)]. 


For a sequence of numbers, 21, * * *, Vm the rank r, of the «-th number «, is 
of course one more than the number of smaller a’s; therefore 


n 


1 
l= D + ce, = Xp) 
B=1 
(5.26) 
n 
% t 1 AS tinue, 
pl 


al, this definition for r gives what is known as the 
midrank. For a set of equal ’s, each has the same rank, the average of 


the ranks that would have been assigned had the x’s been all different. 
From the above it is seen that any function of ranks can be represented as 


a function of the (3) signs of differences, and of course any functions of 
the signs of differences can be represented as functions of a a 

For a sequence of vectors, Xi, °° ‘> X,, Where Xa = (wa"s* ‘= xy’), We 
can define for each vector coordinate a set of n ranks, r{’,*** Tn’ 
r is the rank of x? in haen a}: Consider a kernel SC ** Xm) 
of a U statistic, and suppose that f(x °°" X,,) is a function only of signs 
of differences, s(x” — af), for a, B= 1,°°*,™m and i= l'ar. I 
corresponding U statistic can then be expressed as a function of the ranks, 
rò, 

Let X,,-+*,X, be independen 
over R". The kernel f(X1 °° *> Xm 


If some of the x’s are equ 


t and each have the same distribution 
) defined above will automatically 
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satisfy the conditions of the Theorem 3.1. For this we need only show 
that it has finite variance. From its definition f(x}, ** -, x,,) can take on 
only a finite number of real values, and therefore the induced distribution 
of f(X,,°-+, Xm) will have finite second moment (in fact, the induced 
distribution will have bounded second moment, regardless of the distribu- 
tion for the X’s). Then, if U is the U statistic corresponding to f, and 7 is 
its expected value, we have by Theorem 5.1 that n!/2(U — 7) has a limiting 
normal distribution with mean 0. 

For the remainder of this example we consider the application of the 
above ideas to the difference sign correlation. The application to the 
rank correlation coefficient, the grade correlation coefficient, the partial- 
difference-sign correlation coefficient, and a statistic used by Mann for 
detecting trend will be considered in Problems 10 to 17. 

Consider the sequence (z{, x”), - - -, (e9, x), where the 2” are real. 
For each coordinate we can form n(n — 1) signs of difference, s(x? — 2), 
for a, p = 1, +-+, n (a Æ f). These n(n — 1) numbers satisfy 


> se — 2?) =0 

aFp 
for i= 1,2. Hence, if we define ¢ to be the covariance between the 
n(n — 1) values for the first coordinate and the corresponding n(n — 1) 
values for the second coordinate, we have 


1 2 2 
(8.27) t= > stall — aff sP — aff, 
n(n — Gh 
t is called the difference-sign covariance of the n pairs (x), 2®)), 
Tf all the as and all the x'?’s are different, then 


2, s(x? — a?) = n(n — 1), 
a#ß 
and z is the product-moment correlation of the difference signs. 

t is a U statistic with the symmetric kernel s(x!) — 2) sP — a). 
Let (X{?, XP), -» +, (X, XP) be independent and identically distributed 
over R? with distribution function F(x). From our theory in Chapter 4 we 
know that ¢ is an unbiased estimate of the parameter 


6.28) r= | f staf? — PeP — af) arta) aPC 
R? JR 
and for a sufficiently large class of F’s it has minimum variance among 


unbiased estimates of 7. For X, and X, independent with distribution 
function F(x), 7 is the covariance of the sign of the difference between the 
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first coordinates with the sign of the difference between the second 
coordinates. If+*, 7~ are the probabilities that the two signs are the same, 


different, then of course 


+ 


(5.29) a 


and, if F(x) is continuous, 7+ + 77 =1 and 7=2rt— l= 1—27. 
Assume now that F(x) is a continuous distribution function. Using the 


notation of Theorem 5.1, we have 
JEX) = Else) — XP) — xX} 
= F(x, 2) — [F(e, 0) — Fa, 2)] 
(5.30) — [F(o, x) — Fe, x))] 
+ [1 — F@™, co) — F(oo, 2!) + Fe, z®)] 
= | — 2F(x™, 00) — 2F (co, t) + 4F(e™, a), 
The variance of t is 


(5.31) var {t} 


2 
n(n — 


pe“ 2)4 + b), 


where 
G =H) 
by = EXP — XPA P — XP) — 7° 


=l1—7% 


If ¥ and X) are independent and continuous, then the induced 
distributions of F(x, 00) and F(oo, «°) are uniform on the interval 
(0, 1]. Designating by U;, Ug two independent random variables with 
this uniform distribution, we have 


aes 7 = EAP — XPI EOP — APY} 
=0, 
(5.33) t, = E{(1 — 2U, — 2U; + 4U,U2)"} — 0 
a 
5° 
(5.34) k=l, 
and hence ae 


(5.35) var {t} = one — 1) : 
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If, however, we have a discontinuous distribution function, then in the case 
of independence the var {t} will depend on the probabilities at the 
discontinuities. 

Theorem 5.1 says that n/? (t — 7) has a limiting normal distribution 
with mean 0. The variance of the limiting distribution is 4/9 when the 
distribution function is continuous and the coordinates are independent. 
If with probability one X‘ is an increasing function of ¥™, then it can be 
shown that ¢, = 0, and hence n'/(t — 7) converges to zero in probability. 


EXAMPLE 5.3. Consider the problem of independence, Section 2.2 in 
Chapter 3. (XP, XP), +++, (XP, XP) are independent and have the 
same distribution function F,(2, x) where 0 indexes the absolutely 
continuous distributions over R?. The problem is 


Hypothesis: Fy(x, a) = FMAM) FP (el) for all (2, 2), 0 EQ, 


Alternative: Fy(a, x) 4 F{YM%(aM)FP(x)) for some (a, a), 0 eQ. 
According to the theory at the end of Section 3.2 in Chapter 5, any in- 
variant test function for this problem can be represented by a function of 
ranks and hence of difference signs. Also we have the simplicity that any 
statistic based on difference signs has a single distribution under the hypo- 
thesis. Two statistics frequently used to form such tests are the difference- 
sign correlation t defined in Example 5.2 and the rank correlation k’ 
defined in Problem 10. Theorem 5.1 enables us to choose the constants to 
give the test correct size for large n and also enables us to find the power 
function for large n. 

For the difference-sign correlation £, we have Eft} = 0 under the hypo- 
thesis. Hence a natural test is to reject when |z| > c,. For a size-« test 
Theorem 5.1 and formula (5.35) show that, for large n, 


2 
En ~ 3a 


Bis 


where [—z;, z,] is the interval containing probability 1 — « for the normal 
distribution with mean 0 and variance 1. 
For large n the power function is given by 


P, = Pr {|+| > ca). 


Since var {t} = O(n) and c,,— 0, the power approaches one for any 
alternative distribution having 7 40; and, if r = 0, the limit of the power 
will be less than one. 

For the rank correlation we have similar results. The test is to reject 
if |x| > ci, where cj, is chosen to give the test correct size. Again for any 
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alternative distribution we can find an approximate value for the power 
when vn is large. Also if x 40, the limiting value of the power as n 
increases will be one. 

Thus the two tests are consistent against alternatives for which, respec- 
tively, 7 40 and « 40. 


EXAMPLE 5.4. In Example 1.1 we introduced the Mann-Whitney test 
for the two-sample problem. Let Xi, ++% Xn be independent with 
continuous distribution function F(x), and X,, 435° ° > Xnying be indepen- 
dent with continuous distribution function G(x). The Mann-Whitney 


test is based on the statistic 


1 
2— > > dress), 


where c(i) was defined by (5.25). We consider the application of Theorem 
5.6 to obtain the limiting distribution of v as the sample size increases. 

Without loss of generality let m < 7a vis a symmetric function of 
s By and a symmetric function Of Xy 43s ' ntng Also, since v 
takes on only a finite number of values, its second moment exists. Then, 
if 7, ng —> œ such that n/n has a limit, then by Theorem 5.6 


tp" 


n!?(V — E(V)) 


has a limiting normal with mean 0 and finite variance. For large samples 
the limiting distribution under the hypothesis can be used to give a test 
size «. Also, for large samples the limiting distribution under an alterna- 


tive (F(x), G(x)) gives the power function of the test. 
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ped a method for constructing tests 


In Section 2 of Chapter 5 we develo 
lar parameter value of the alterna- 


having maximum power for a particu h 
tive. The tests were conditional tests. For each example considered, the 


test statistic was a linear function of the coordinates of the outcome, and 
the hypothesis conditional distribution gave equal probability to each ofa 
finite number of permutations of the coordinates of the outcome. In this 
section we consider a theorem by Wald and Wolfowitz and a number of 
extensions which give the large-sample hypothesis distribution of such test 


statistics. 
Let H, = (hım °°» Ann) forn = 1, 2, + + - be sequences of real numbers. 
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In their theorem, Wald and Wolfowitz considered sequences that satisfy 
a condition which we designate by W: 


Condition W. For all r = 3, 4, +- >, 


1x x 

- 2. (in —h,) 

1 n k T, 
l È, in = in| 


n 
where h, = n > Ai, 
1 


(6.1) 


= = OU), 


The condition says that, as ” increases, the rth central moment standar- 
dized with respect to the variance should be bounded. Another condition 
was introduced by Noether and we designate it by N. 


Condition N. For all r = 3, 4,+++, 


(6.2) oo = o(1), 


[> Ga ae 


izi 
n 
where h, =n > Higgs 
1 


Corresponding to a sequence /,, we define a random variable 
X, = (%4, +++, Xn) which takes each permutation of (a,,, +++, @,) with 
the same probability 1/n!. Then, corresponding to sequences 7, and 
€ „ we investigate the limiting distribution as n — oo of the linear expression 


(6.3) Ly = Cig Xp He + Cin Xn 

It is straightforward to prove 

(6.4) 2. Cin 2 Ain 
LL 


and 


(6.5) var {L,} = — z (Cin — ĉ,)? > (aj, — åp)”. 
i=l j=1 


1—1 
(See Problem 18.) 
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The original theorem concerning the distribution of L, was proved by 
Wald and Wolfowitz and was a generalization of a limit theorem for the 
rank correlation coefficient derived by Hotelling and Pabst. Our first 
theorem is an extension of the Wald-Wolfowitz theorem and was proved 


by Noether. 

THEOREM 6.1. (WALD-WOLFOWITZ-NOETHER). If @,, satisfies condi- 
tion W (6.1), and Z, satisfies condition N (6.2), then 
L, —E{L,} 

OL, à 

where L, is defined by (6.3), has a limiting normal distribution with mean 0 
and variance 1. 

Proof. Let Cpe, b& a symmetric function generated by cjt+ ++ cms 
that is 
(6.7) Conse = >. cfs cir, 
where the summation is over all permutations (i, °° im) Of m integers 
chosen from (1, «++, n). Similarly let 4,,-.c,, designate the corresponding 


symmetric aio for the Z „ sequence. 
If we multiply each element of a sequence oy 
constant to each elenen we do not alter L4. 


the theorem when > Gin => Cin = O and Sai Gin =f c?, =n. These 


relations together with the conditions W and 'N establish that 


q=0, G= , = O(n), r= 3,4,° 


8.6 ‘ 
( ) A = 0, Ag =n As = o(n”’*), r= 3; 4, HPL, 


Then we have 
E{L,} = CGE{%} = 0, 
var {L,} = E{Li} 
= C,E(Xj) + Cn E(X, X) 
C?— G(Ai — 4a) 


(6.6) Lo = 


a constant, or if we add a 
Hence it suffices to prove 
n 


1 is 1 ( 
i Cala 3 n(n — 1) 
1 1 
ee = —— Gy 
~~ Hn CoA: n(n—1) ` 


E Gah 


| aa 


~h. 
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By Theorem 2.6 it is equivalent to prove that n-/2L,, has a limiting normal 
distribution, and we shall do this by the method of moments, using 
Theorems 2.1 and 2.2. We now prove that the rth moment of rey, 
approaches the rth moments of the standardized normal distribution. 

We have 


(6.9) H, = nPE{L} 
n n 
=y TF ii 
= it DD aya 
j=l i= 
= nC, ECX) mala eA Ta em)C, C Em EX E } 
tb Cy eg E{X ++ XH 

where e +--++e, =r, e; fork =1,--:,misa positive integer, and 
the coefficient c(r, e}, ***, em) is the number of ways that the r indices 
4,*+*, i, can be tied into m groups so that the m groups in the order in 
which their first element occurs in the sequence j, - - -, i, are, respectively, 


OPSIE jy“ "y Gigs 
Since E{Xit +++ Xie} ~ nA. e We have 


(6.10) PPC, reg a 0 + + + XG} MAME, Ae es 


and we designate by B(r, e1, ***, €m) the right-hand side of this relation. 
To complete the proof of the theorem we need a lemma which we shall 
prove later. 


LEMMA 6.1. B(r, e1, ***, €m) ~ 0 unless 


(6.11) m y= =e, =, 


in which case B(r, 2, +++, 2) ~ 1. 


By (6.9) u, is the sum of a finite number of expressions B(r, €j, ** *, €m). 
Therefore, if r = 2s + 1 (s = 1, 2, » - +), Ho, ~ O since at least one of the 
es in each B must be odd. If r = 2s, Hə, ~ c(2s, 2, ++, 2). Since the 
first index of the expression being summed in (6.9) can be tied with any of 
the 2s — 1 others, the next free index with any of 2s — 3 others, etc., it is 
seen that ə, ~ (2s — 1)+++3. But these are the moments of the normal 
distribution with mean 0 and variance 1, and hence the theorem follows. 


Proof of Lemma. Let A( ji, * ``, jn) = Aj, +++ Aj, Then by the theory 
of symmetric functions A,,...., can be expressed uniquely as a linear 
combination of a finite number of A(jj, © - -, ja), where 


(6.12) Atcctp Hato ten= 
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and the j’s correspond to sums of e’s. Since A, = 0, we need only con- 
sider A(j,,+++,j,) having j, >2 (g=1,---,A). If some j, > 2, then 


(6.8) with (6.12) implies that 


Alji *s Jn) = O(n"). 
If all 7, = 2, then 
AQ, **+,2) = AP 
(6.13) = ptt, 
r is even, and, from the remark following (6.12), all the e’s must be 1’s or 
2’s; therefore m > r/2 unless (6.11). From this we have 


Aeren = o(n"/?) m < r]2, 
m 
and certainly 
= ont 
Aeren = o(n”) m > r/2, 
unless m7 = r/2 and e, = ++ * = em, in which case 


Aaa ~ A(2,** +, 2) = n’. 
Similarly, writing C,,....,, aS a sum of products of the form C;...;,, we 
obtain the relations 
E 


erem 


= O(n) m<r/2 
= O(n?) m> 1/2. 
Combining these results, we have 

Aye Cove, = o(n™*"/2), 
unless r is even and all the e’s are equal 2, in which case 
Bg ving E: CE, 


This proves the lemma. ae : 
The condition N introduced by Noether can be given in two simpler 
forms, both of which are more convenient for application. We have 
Tueorem 6.2. (HoeFFpING). The condition N for #,, is equivalent to 


either of the following two conditions: 


2, la —h, |" 


(6.14) lim —; 5=0 forsome r> 2; 


iii [> — i] 


1 
n 


max (hin — hy)? 
1 


(6.15) =. 


j ET 
> (h h 
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Hence, if @,, satisfies condition W (6.1), and sZ, satisfies any one of 
conditions (6.2), (6.14), (6.15), then L? given by (6.4) has a limiting normal 
distribution with mean 0 and variance 1. 


Proof. Let 


hin —h 
[> (hin — n] 
1 


Gn = max {g1, °° *; Zn}. 


n 


g. 
òi 


and 


We must prove the equivalence of the three conditions: 


n 


6.16 li ef =0 = 3 4 
616 fo De 
n 
(6.17) lim 2. |g.|"=0  forsome r>2; 
n> am f 
(6.18) lim G, =0. 


n= 


Since we have 


n 
a= 1, 
i=1 


n 


G<> la eg Sea 


i=l i=1 


then, for r > 2, 


These inequalities imply the equivalence of the three conditions. 


There are extensions of Theorem 6.1 which give the joint limiting 
distribution of a number of Ly statistics. We consider one such extension 
giving the limiting joint distribution of two Ly statistics. 


THEOREM 6.3. If Zy satisfies condition N (6.2), if €„, and 2, satisfy 
condition W, and if the correlation between @,, and 2,, 


n 
>, Cin — Eda — da) 
1 


n n 1/2? 
È (in — Ca)? 2 (din — a| 


(6.19) Pn 
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has a limit p, then the limiting distribution of 


(6.20) L? La ars E(L,) es L, aaa E(L,) 
i a(l) ? i (L) 


n 


n 
where L,, = > CM jo dan = > d,,,X; [see (6.3)], is bivariate normal with 


1 1 
means 0, variance 1 and correlation p. 


Proof. By the same argument used in the proof of Theorem 6.1, it 
n n n 
suffices to consider the case when > 40 = > Cog = 5 din =0 and 
ak n n T T T 
= a= > a= > dè =n, Then it is easily seen that Zi is 
1 1 


4 
asymptotically equivalent to L,, and has a limiting normal distribution 
with mean 0 and variance 1, and similarly for LZ’) and L’,. Asa first step we 
shall prove that, if p 4 +1, then, for any 6}, 59, the linear combination 
OL, + dL’, has a limiting normal distribution with mean 0 and variance 
OF + 03 + 25, dep. 
To apply Theorem 6.1 to the linear combination, 


n 
(6.21) I Ly + ÔL = D, (Orein + Oodin) Xo 
1 


-+ ô2 „ satisfies the condition W. For this 


we need only show that 6,@’,, 
| and consider the second moment 


we taken large enough that 1 > p’ > [Pn 
of the elements 5,¢;, + Soin? 


n 


12 Lo n ` 1 j ji ‘ = 
s: Š (OyCin + Oodin)? = 5 òi 3 AT òz > din — 25,0] >, Cindin| 


1 
= 03 + 63 — 26,0] Pal 
> 52 + OF — 26,5op"- 
Hence the denominator of the condition W expression for 6,6, + OD n 


is bounded from zero. We now show that the numerator is bounded. 
The numerator is the rth moment of a sum, Ô,Cin + Ô2din This rth 


moment is bounded by 

ar n 2r n 
6. pedi air 5 Diag 
(6.22) ie > [cin] +572. | 


* |2|" + |2y|"; (6.22) is easily seen 
W all the moments of @,, and 2, 
+ 6,Z,, satisfies condition W. 


by virtue of the inequality |e + y| 
to be bounded because from condition 
are bounded. Thus the sequence 0%, 
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Applying Theorem 6.1, we find that the limiting distributiom of 6,L,,-+6,L’, 
is normal with mean 0 and variance 
(6.23) 03 + 63 — 26,5op. 

Let (Y, Y’) designate a random variable having a bivariate normal 
distribution with means 0, variances 1, and correlation p. Then we can 
say that the limiting distribution of ôL, + ôL; is the distribution of 
ô, Y + ô Y’. Ifwe knew that the joint distribution of (L,,, L4) approached 
a limiting distribution, say of (Z, Z’), we could apply Theorem 2.8 and 
state that ôL, + ôL, had the limiting distribution of 6,Z + 5,2’. 
Theorem 2.5 would then imply that (Z, Z’) had the same distribution as 
(Y, Y’). However, if the distribution function of (L,,Z/) does not 
approach a limit, it is easily seen (cf. Cramér [15], p. 60) that two sub- 
sequences can be extracted which converge to different limiting distribu- 
tions. This contradicts the result above that the limiting distribution 
must be identical to that of (Y, Y’). 

If p = +1, then it is easily seen that L,, + L}, has a limiting normal 
distribution with mean 0 and variance 2 and that L,, — L}, approaches zero 
in probability. This proves that (L, Lj) has the limiting bivariate 
distribution (degenerate) as stated in the theorem. A similar result is 
obtained if p = —1. This completes the proof. 


The conditions of Theorem 6.1 have been modified by Hoeffding, who 
at the same time has considered the limiting distribution of a more general 


statistic. We shall quote his theorem, but first we introduce some 
necessary notation. 


Let b,(i, j) (i, j = 1, ++ +, n) be n? real numbers defined for every positive 
integer n, and let (R}, ++ +, R,) designate the random variable which takes 
each permutation of (1, + + +, n) with the same probability 1/n!. Theorem 
6.4 is concerned with the limiting distribution of the random variable. 


(6.24) Bi Ž b,(i, R). 
We define 


d,(i,j) = b,(i, D- Dbe p-3> b,(i, h) er a b,(g, h). 


Then it is straightforyard. io prove that 


ji n 
(6.25) E{L,} == È blij) 
and 
(6.26) var L) = —— be x (i, j). 
ij=1 


Problem 19 is to prove these relations. 
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THEOREM 6.4. (HOEFFDING). If 


ey Sa Kop) 


(6.27) lim man m=% 1=3,40°5 
n> B > 2, ali, al" 
i=1j= 


then ZL, (6.24) is asymptotically normally distributed with mean and 
variance given by (6.25) and (6.26). Condition (6.27) is satisfied if 


(6.28) lim stis = 


n n 


“> = alij) 


a= 


Proof. See Hoeffding [7]. 


n 
In the particular case having L, = > CinOnm We obtain more general 
1 
conditions under which Theorem 6.1 remains valid; we have 
Theorem 6.5. If 


>a A Dona —4,) 


(6.29) lim n”? 7a 0 


> (in — ĉn)? 2 (@in — ā,)? i 


n 


then L, = > Cin@rn is asymptotically normally distributed with mean 


1 : . 
and variance given by (6.4) and (6.5). Condition (6.29) is satisfied if 


max (Cin — ĉn)? MaX (din — Gn)? 
in 0. 


(6.30) lim n —; 


me Din i > ea : 


Proof. This is an immediate corollary of Theorem 6.4. 


Dwass [11] has also obtained an extension of Theorem 6.1. Let 
Xi, ++, X„ be independent and each have the continuous distribution 
function F). Designating by @a» ** +, %,)) the order statistic over R”, 
We define a sequence (bin ` * Dag): 


: 1 n r 
bin = E(XG) — ea EX} 
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THEOREM 6.6. (Dwass). If k>1, if fie dF(x) <<, and if 
either X/ has a normal distribution or 


(6.31) lim, =N 


then L= > CinP rin is asymptotically normally distributed with mean 
I 


and variance given by (6.4) and (6.5) (with a’s replaced by b’s). 


Proof. See Dwass [11]. The method of proof is to show that asympto- 
tically the random variable L,, is equivalent to the random variable 


n 


(6.32) LE = > inf, 


i=l 


which can be shown by the central-limit theorem to have a limiting normal 
distribution. 


In many applications of the theorems a sequence will be the observed 
value of a sequence of random variables. Itis then of interest to inquire 
whether the conditions of a theorem are fulfilled with probability one for 
large samples. We have 


THEOREM 6.7. If X}, ++, X,, are independent and identically distributed 
and if var {¥,} > 0 and E{|¥,|}< oo, then with probability one the 
sequence (X4, **', X,,) satisfies Noether’s condition N (6.2), (6.14) or 
(6.15). 


Proof. By Kolmogorov’s Theorem 3.2 (strong law of large numbers), 
it follows that with probability one each of the following limits holds: 


n 
lim n > X,;= E(X) 
2 


no 


lim n > X? = E(X?) 
> 


n—=> o 


n 


lim nt > |X|? = K| X|», 


no T 
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where X designates a random variable with the distribution of the X;. 
From this it follows simply that with probability one 
lim n7X(X, — X} = E{[X — E(X)P} 


no 
lim x |X, — Z|? = |X — E(X)|*}. 


n>a 
Then, since var {X,} > 0, we have with probability one that 


n 


=I : ïB 
Pi 2 Me FP pay — rw 
mk, AP (EEX — EOP}? 


n— 


But this implies that with probability one that condition (6.14) is fulfilled. 
This with Theorem 6.2 completes the proof. 

If @,, = (Cin * t, Cyn) is a Sequence satisfying condition W (6.1), and if 
Xo, X,, are independent random variables satisfying the conditions of 
Theorem 6.7, then with probability one the sequence (Xj,°°", XG) 
produces an outcome (2,,°**:#,) for which the limiting distribution of 

n 
(6.33) he = Cin R, 
is normal. i 

As a related result we have a theorem proved by Hoeffding in [8]. 
Let Ais an X, be identically distributed according to a distribution having 
Et| X |} < œ and var {X} > 0. 

THEOREM 6.8. (HoerFvING). The condition X is normally distributed 
Or the condition 


(6.34) lim —; 


no - 
> Cm = é,)? 
I 


is a necessary and sufficient condition that, as n -> œ, the probability 
approaches one that the random variable (Xj, * * *, X¥,„) produces outcomes 
884, w) for which the limiting distribution function of 


n 
> (Cin — nr, 
T 


sSXp) = — 7 12 
[> Cni- DD a 
1 


max (Cin — E _ 9 


(6.35) 


1 
is within any preassigned amount of the normal distribution function with 
mean 0 and variance 1. 
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Proof. See Hoeffding [8]. 


Note. If F(s; x) is the distribution function of s(x,), the theorem gives 
a necessary and sufficient condition that F(s; X) — @(s) stochastically as 
n— œ where ®(s) is the standardized normal distribution function. 


We complete this section by quoting a related theorem also proved by 
Hoeffding [8]. Let X;, +++, X, be independent and identically distributed 
according to a distribution having E{| X|} < co and var {X}>0. Also, 
let the random variables Z,, ---, Z„ be defined by 
(6.36) Zi = Xi + din 


where 2, = (dim ** *, dan) is a sequence of real numbers. 


THEOREM 6.9. (HOEFFDING). In order that, as n — 00 the probability 
approach one that the random variable (Z,,---,Z,) defined above 
produce outcomes (z4, * - -, z,), for which the limiting distribution function 


of 
D Cin — nden, 
1 


an ce 
[> (Cin = é,)?(n — i > @ = | 
1 


1 


(6.37) S(Zp) = 


is within any preassigned amount of the normal distribution function with 
mean 0 and variance 1, a sufficient condition is that 


either X is normally distributed 


(6.38) max (Cin — n)? 
or lim oe =0 


and 
n 


> Cm- 8) D (dn 4)” 
(6.39) lim n”/2-4 _1 1 


= [Zene [Sn — ane] 
1 1 


the latter condition being satisfied if 


p12? p=3,4, 5 


max (Cin — Ën)? max (din — dp)? 


(6.40) limn = 


È Caid D Clin — dd 


0. 
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Condition (6.39) can be replaced by 

(6.41) lim at pa (din — d)? = 0. 
1 

Condition (6.38) can be replaced by 

(6.42) lim n S (din — dn)? = œ. 
I 


Proof. We merely sketch the main idea of the proof; for the details, 
see Hoeffding [8]. By using (6.36) the numerator of (6.37) can be broken 
n n 


into > (Cin — @,)&p, and È (Cin — ndra Then condition (6.38) with 
I 7 


Theorem 6.8 gives the first expression a limiting normal distribution, and 
condition (6.39) with Theorem 6.5 gives the second expression a limiting 


normal distribution. 


EXAMPLE 6.1. The rank correlation coefficient is defined in Problem 11. 
In Example 5.3 we considered its use for making a test of independence. 
Let (¥(), x), +++, (XW, XO) be independent and each have the same 
Continuous distribution function over R?. The problem of independence 
is to test the hypothesis that the coordinates of the bivariate ee 

> n 


are independent. Letting rP, >t 7) designate the ranks of was 
- +, e, then we can 


and similarly r(, «+ «, rf? designate the ranks of <P, 
Write the rank correlation statistic 


(6.43) a= 2 
1a < 3(n + 1)? : 


2 
1 — 


n3? —n 


n 


An equivalent statistic is > 7D 72), 
S : istributi f the statistic 
based on X’ or > ®© r, we need to know the distribution o 
a a? 


Under the hypothesis. This has been tabulated for small n, but, as n 
increases, the numerical work to obtain the distribution becomes excessive. 
However, the Wald—-Wolfowitz theorem gives the limiting distribution, and 
It has been observed that the approximation is good even for n of the order 


To construct an independence test 
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of 10. For the statistic £ r® r& we can describe the hypothesis distribu- 
tion by the random variable 

n 

> iko 


i=l 
where (R,,°*+, R,) is a random permutation of (1, +-+, n). We apply 
Theorem 6.1: 


Defeat ieee go 
2 4 


1 jr r 
PA = O(n’). 


From these two relations it is easily seen that the conditions of Theorem 
n 


6.1 or 6.5 are fulfilled. Hence > iR, has a limiting normal distribution 
with mean given by (6.4), I 


L (n41? n? 
6.44 > J a E 
(6.44) z| ? ir) n 4 4 


and variance given by (6.5) 


n 
: n(n? — 1)? në 
(6.45) var | > | = Mnf ia 


In Chapter 5, Section 3, we developed a method for finding rank tests 
having maximum power at a particular distribution in the alternative or 
having locally maximum power for a parametric class of alternatives. A 
test (Problem 19, Chapter 5) of independence having locally maximum 
power against normal alternatives involving dependence is based on the 
statistic 


(6.46) der byo, 


i=1 


where 


n -1/2 
g= [> ezon] š 


bi = EfZin), 


(6.47) 
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and E(Z,), * ++, E(Z;,)) are the expected values of the order statistics for a 
sample of n from the normal distribution with mean 0 and variance 1. 
The statistic’s limiting distribution under the hypothesis of independence 
can be proved normal by the use of Theorem 6.6 (see Problem 20). 

The similar test most powerful against normal alternatives was developed 
in Chapter 5, Example 2.2. It is a conditional test, given the order 
Statistics ({2, <- a{2), (eB, + + +, via) and is based on the statistic 


(6.48) > DaD, 

Under the independence hypothesis we consider the limiting form of the 
distribution of this statistic as the sample size n increases. Assume that 
var {X} > 0, var {X} > 0 and that the third absolute moments are 
finite. By Theorem 6.7 it is seen that with probability one the set 
(2, -- +, 2) will satisfy condition N. Then by Theorem 6.8 it is seen 
that, as n—> co, the probability approaches one that the conditional 


n 
istributi : a 2) oa. al). 4 
distribution of > XOXO), given (xB, +++, aM), and (iG, +++ a2), is 
1 


within any preassigned amount of the normal distribution function having 
the same mean and variance. 

EXAMPLE 6.2. In Chapter 5, Example 2.1, we showed that Pitman’s 
two-sample test was the similar test most powerful for normal alternatives. 
Let (x,,-++,%,,) and (Bay y,4n,) be outcomes for the ‘first? and 
‘second’ samples. The Pitman test is a conditional test, given the order 
Statistic (Xa), ** *, Venny) for the combined sample and can be based on 
the statistic i 


na my 


(6.49) 2 Tnytj 2 a 


Ng nı 
To perform the test we need to know the hypothesis distribution of this 
statistic, given the order statistic. The limiting form of this distribution 
can be found by using Theorem 6.8 or Theorem 6.1 with Theorem 6.7. 
For we can describe the hypothesis distribution by the random variable: 
ny 


S Lama + > (i) 
2 P —— | tir) - 


= i= 
The coefficients, 1/7, — 1/m, obviously satisfy (6.34) as the sample sizes 
approach infinity in a given ratio. Then, if E{| X|°} and var {X1} > 0, 
we have that the conditional distribution of (6.48) under the hypothesis 
approaches in probability the normal distribution as the sample sizes 


approach infinity in a given ratio. 
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The rank test locally most powerful against normal alternatives is based 


on the statistic 
No 1 nı =) 
> tec! (=) + > EZ a} (=) 
j=l i=l 


where the Z’s were defined in the previous problem with n = n, + n and 
Fist o Fajna are the ranks of a,---, atn, in the combined sample. 
By Theorem 6.6 the statistic has a limiting normal distribution under the 
hypothesis. 


7. THE LIMITING DISTRIBUTION OF RUNS AND ADDITIVE 
PARTITION FUNCTIONS 


In this section we quote some limiting distributions which can be used in 
the construction of two-sample tests. Let V = (V,+, Vien) be a 
random variable which takes each different permutation of 1, 1’s and 7 2’s 
with the same probability, 

(™ + a” 
ny 


Let v= (vpt "s Vatna) designate a typical permutation. Then we 


define a run of I’s to be a set of consecutive 1’s in (v4, +*+, Pn rin) preceded 


by 2’s or the beginning and succeeded by 2’s or the end. A run of 2’s is 


defined analogously. We now consider some statistics which are functions 
of v; let 


rı; = number of runs of 1’s of length j, 
"2; = number of runs of 2’s of length j, 
ri; = number of runs of 1’s of length j or more, 
ra; = number of runs of 2’s of length j or more, 
rı = number of runs of 1’s, 
r, = number of runs of 2’s, 
r = number of runs. 
There are relations among these statistics, for example: 
Ay=tut naa tee 
Ry = Ape fark, 
r = r + ro 
Corresponding to the random variable V = (V}, + + +; Vp +n,), We now 
find the induced probability distribution of the r,,’s and r;’s. First we note 
that the number of runs of I’s can differ from the number of runs of 2’s 


by at most 1, since the two types of runs alternate in the sequence. 
Also, a given sequence of 1 runs and a given sequence of 2 runs can be put 
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together to form a V sequence in only one way if r, — rz = 1 and in two 
ways ifr, = ry, For this it is convenient to define a function, 

F(yyr)=0 if |r—r|>1 
(7.1) =1 if [~—n|=1 

=2 if y= 
Consider the runs of I’s. There are 7}, of length 1, ry: of length 2, and so 
on. The number of ways of arranging these to form different sequences is 
r! 
ru! relt 

Similarly, for the 2’s the number of ways of arranging ry, runs of length 1, 
"92 of length 2, and so on, is 


Ta! 
Ta! Pog! ** 
Also, a sequence of 1 runs can be put with a sequence of 2 runs to form a V 
Sequence in F(r,, ra) ways. Hence we have 
(7.2 n! on! Ft), 
2) PFs Pray * $F ats Teas" j= 1 ack on 
9 z TMu! Ie! ( Nyt ig 


ny 


The number of ways of forming r runs from a sequence of ng 2’s is 
the number of ways that rą — 1 dividing places may be chosen from the 
nz — 1 spaces between 2’s in the sequence. Hence we have the marginal 


distribution for kpg Ts 


: n! (me —1) Fror) , 
(3) P(r rig ttt ra) = Tirs! ip = i) (” F a 
n 


Then, summing this over rọ, we obtain 
Pli ria) 


sa l n—1 we bi. 
Ir! ( = I T A io 


(7.4) n ia 
— 4! y=] Mg—ry+1 , (mrd 1) =| 1 
5 ray! ie = a |: = nl ‘ n(ri— 1) i P ie 
n 
m + ') 
= al A 
Tr! (" + ":) 
ny 
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By the argument that produced (7.3) we find the distribution for r}, rą to 
be 


(7.5) Pe = be = ') in = n) l Flr) 


rp —1/\ry.—1 Pied 
n 
and the distribution for r to be 


n=l ae E E 
P(r) = riii idie 
2 2 n 


(eE) ñ; —1 n — 1 1 
= | |r+1 r—l -l 1 (= ) Np! 
a aa J ee es ea e 

2 2 l 2 l 2 l n 


if r is odd. 


We now consider the problem of finding the means, variances, and 
covariances for the ry, Fis, * + * and Fog Fa ** t. As is the case with most 
discrete distributions involving integers the factorial moments are the 
natural moments to calculate. If we define 

x = g(x — 1) +- (e—a + l), 
then the ath factorial moment of a real random variable X is defined to be 


E(x}, 


From the first and second factorial moments it is straightforward to 
calculate the ordinary means and variance: 


E(X) = E(X) 
var (X) = E(X)) + E(X) — [E(X)P. 
We illustrate the method of calculating factorial moments by evaluating 


E{r,;}, and then we quote the formulas for the other means and for the 
variances and covariances. We have 


Efri} = 2 Py te?) 
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where the summation is over all positive integers ry, 712 *** such that 


> ine =n. 
n + ') 
r! ( r 


E(r;;) X ml Melon Ę ae *} 


m 


(7.7) 
(a + n) 
” r! r ; 

> Fy! ie! yy — De: ig + i 


m 


where the summation X” is the same as X’ only excluding cases for which 
tu =0. 
m+o(™ ) 
Ng 
” r, — 1)! 7 n-—1 
Finj => z M De aj 
ral ral fu D! i +} 


ny 


ni) ” (r, — 1)! 
= (2) L 
mep (my + ng)? pe tytn! Cae DE 


i 
n—1 
Se 


n—i 
This last is obtained from the relation 


My my) m+ =" —i+h — ') A 
ni ag nny ny — 1 


But the summation in the expression above produces a total 1 since the 


terms being added are the probabilities (7.4) with n, replaced by m, — i and 
Ne replaced by n — 1. Therefore 


(m + 1) nn 


(7.8) E(ry} = at may’ 


and 
by Symmetry di, + po no 
(7.9) Efra} = ( 


hi ee A, 
i+1 
+ ng) 
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The variance and covariance are obtained similarly but with somewhat 
more work for cov {r; ror} 


nÈ (ng + 1) nE? (ng + 1)2n (ng + 1)@n? 
(7.10) var (r1) = (my ong) 22) (n F nyo [ (my + my | A 
(7.11) cov {riz ra} AD (ng + a nn +i yng 

(m, ++ ng) E492) (m one), Eng) D 
i+2) (j+2 Pr K 
COV {rii P23} a = nf i 
(nm, + n)a (m + n) D 

(7.12) i nn? (m + (ng ++ 1) On? 


yb ny (ny + ng) O(n, F )FD 
The formula for var {ry;} and cov {rz; r2;} are obtained from (7.10) and 
(7.11) by interchanging the subscripts 1, 2. 
By obtaining the probability expression for Cergas s hue thd 
Far ** "+ Tenn) and by an analysis similar to that above, Mood [1] 
obtains the formulas 


j (n+ 1) n® 
7.13 Jj = = 
( ) Efri,} (ny $ n) ™® 
(7.14) cov {rin ri} = na(n + U)ni'*? na(n + H?n nf? 


(my + nD (my + nAn, + ng)? 
(7.15) var {ri} = fat a mt Da [ fa a id > 
(ny + n) (m + 1) (a, + me) 
ntd +D a 2n +D +D 
(ny + n) HD) 
nE DnD + nn 
(m + DOn + 1) nn 
(my + nd) O ny FD 
n+ Dp +D a nn 
(m+n) H Gi tan 
_ + Dry + Inn? 
(my + ng) + ny) ` 


(7.16) cov {rir ra} = 


(7.17) cov {rim ran} = 


Also, we have 


(ng + 1)m 
Erp fet Dm | 
(7.18) {rn} Gita 


L mnn 
(7.19) var {r} = EEN ` 


a 


— 


ee Í 
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We consider now the limiting distribution of the r;; as 2, and ny approach 
infinity in a fixed ratio. Let 
m 


Sg 
m+m 7 


Ng 
Ny + Ng 


= êy, 


with 0 < e}, e <1. Then, of course, we have e, + e, = 1. 


THEOREM 7.1. (Moop). As n, + 1g—> œ, the limiting distribution 
of any finite set of the random variables 14, ry, ** *, "a1, 2, © * * is normal 
with means and variances given by the formulas above. 


Note. As n, + ny become large, the relation |r; — rz |< 1 becomes 
equivalent to the linear relation ry = ry. The limiting normal distribution 
will be degenerate if the variables chosen satisfy a linear relationship 
obtained from 7; = ro. 


Proof. The proof is given in Mood [1]. The method is essentially that 
used to show the binomial distribution approaches the normal distribution 
The r,; are expressed in terms of the standardized variables 


and a substitution is made in the probability expression. Stirling’s 
formula is used to replace the factorials. The logarithm of the probability 
expression is then shown to approach the logarithm of the multivariate 
normal density (uniform convergence is not necessary; see Scheffé [13]). 


Wolfowitz in [2] introduces a statistic called an additive partition 
function. Let f(x), g(x) be real-valued functions defined for all positive 


integers, and consider the function 


lems tw) =D sf) + Dy ts): 


gmi 


In terms of the sequence v = (vi, * * *s Un,+n,) Which produced the values 
Fin Fis tt, Fop Fog tts I(r t) can be obtained by adding a number 
for each run, the number being f(/) if the run is of 1’s and of length 7 
and 9(/) if the run is of 2’s and of length j. Thus 1(r,,°-*, Toy" ‘) can 
be expressed as an additive function of the partition of 1’s and 2’s into runs 
and accordingly is called an additive partition function. 
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THEOREM 7.2. (WoLrowiTz). If f(x) and g(x) are not proportional to 
x, and if the series : 


Soje 


i=l 
and 
k 
Shole 
i=1 


converge where em = max {e,, ep}, then the additive partition function 


S nf) Si Sst) 


j=1 j=1 


has a limiting normal distribution as n, + n, —> œ with n, = e,(n, + na) 
and n, = e(n, + no). 


Proof. See Wolfowitz [2]. The basic idea in the proof is that the 
additive partition function can be approximated by 


k h 
(7.20) > f+ > rosa) 
j=l j=l 


as n, + "g —> 00, and by choosing k, h large the stochastic approximation 
can be made as close as desired. (7.20) has a limiting normal distribution 
by Theorems 7.1 and 2.8. 


8. PROBLEMS FOR SOLUTION 


1, Show that the moments of a normal distribution satisfy the criterion in Theorem 2.2. 


2. Show that p-lim X, = c is equivalent to the convergence of the corresponding 
distribution functions to F(x), defined by 


F(z) =1 if 2c 
=0 =e 


3. Show that, if lim E{X,} =c and lim var{X,} = 0, then p-lim X, = c. Use 
no n> oO no 


Tchebycheff’s inequality. 
4. X,, Xa t> is a sequence of independent and identically distributed random 


n 
variables. Give conditions under which > (X: — £)? has asymptotically a normal 
1 


distribution. What mean? What variance? 
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5, Let Xj, Xz, + be a sequence of independent and identically distributed random 
variables. By using the obvious trivariate extension of Theorem 2.7, show that under 
suitable conditions the limiting distribution of 


n 2 
KA+ XKt + KaKa <0 (> x) 
1 


n 2 
nasat-n(S x] 


1 


pe 


is normal with mean 0 and variance 1, What conditions? This random variable is of 
interest in tests of randomness against the alternative of serial correlation. 


n 
6. What are conditions under which the conditional distribution of >, X Xia Xavi 
1 


= X;,), given the order statistic for Xi °° +, Xm approaches the normal distribution as 
n-1 


nœ? What is the limiting conditional distribution of > XXi? These two 
1 
statistics are linear combinations, respectively, of the circular and noncircular serial 
correlation coefficients, the linear function being constant, given the order statistic. 
7. By using Theorem 2.4 in Chapter 2, show that če < Um (e=1,°°% m). Če is 


defined in Section 5. 
8. Prove Theorem 5.6. For this define 


S ieo ney Res Ya Ye) = Eff "(2p ts Pep Xan 1" Xm 
Yi t's Vegi Yegen t Ying)}s 
laya = VAE AS Ze Xn s Xai Yan Yeh 
Loo = 0. 


Prove 
m, ma 


1 >. S (mmm )(™m)(m 8) EN 
(™)(*) cı /\ m — c1) \ Ca / \Ma — Cs iiis 
c,=0c,=0 
m,/\ m 
In analogy with the proof of Theorem (5.1) let 
Z, = ùU — n), 


et mani? we 
y,=™ > fat) +—— > fal ¥p)- 
ae na 
: a=l pal 
Prove that the limiting variance of Z, is 


n 
milo + m (im j Es 
2 


alued random variables which are independent and 
Chapter 4) defines a coefficient of concentration 


a4 
=F 


(8.1) var {U} = 


: 9. Let Y, +++, Yn be positive-v 
identically distributed. Gini ([13), 
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n 

where d is defined by (5.21) and 7 = n~! z Y:- Prove that, if E{Y*} exists and if 
= 


u = E{Y} > 0, then 
A 
172 E 
o (: a) 


has a limiting normal distribution with mean 0 and variance 
BY es Aoa ot ed 

ga i) — — 89, d) + — (a) 
4j H u? 


where ¢,(d) is given by (5.23), and 
(9) 


E(Y*} — p? 
= var {Y} 


£9, d) = ff th |V — Ye | dFQn)dF(y2) — pA. 


This requires the obvious vector extension of Theorem 5.1, 
10. Let U™, U) be two U statistics based on the symmetric kernels, respectively, 
LMG Em), fry «+ Em). Prove that the covariance of U® and U is 


given by 
mız 
E E (2 )> keai n— we (12) 
Bvt S uag Le c ma — c ge), 


c=1 


where m, = min {m,, ma} and C4") is the covariance between SFX, +++, X) and 
ft CK, © 5. Xe). , 
11. For a sequence (xi, x1), -+ -, (P, af) let (P, rf, +, qW, r) be the 


ranks as defined in Example 5.2. Show that, if all the first coordinate æ’s are different 


and all the second coordinate ’s are different, then the correlation coefficient between 
the first and second coordinate ranks is 


n 
12 w n+l a atl 
.2 = —— =] -—_ 
ea k n —n (2 2 2 

a=1 

3 n n n 

S Y s(x — a) s(a — al?) 

a=1ĝ=17=1 


_ (n—2)k+3t 
ep i 


where ¢ is the difference-sign covariance (5.27) and k is given by 
= 3 -aO AN ofat) (2) 
(8.3) Ps Sa > SED — aed — al, 


where the summation is over all æ, £, y which are different. k’ is called the rank correla- 
tion coefficient. 


ŘE G 
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12. If (XY, XO), +++, (Xf, Xf) are independent and have the same continuous 
distribution function F(x, x‘®), show that & defined above is an unbiased estimate of 
« Where 


k= | | (2F(2™, ©) — 12F (20, 2!) — 1] dF, x). 


Show that « is the coefficient of correlation between UW = F(X"), œ) and U®) = 
F(x, X'®). U' has been called the grade of the random variable ¥) and hence « is 
called the grade correlation coefficient. 

13. k defined in Problem 11 is a U statistic and has a kernel g(2,, 2, 2), 


Es ay 2) = 3562] = af”) sale? — af"), 


A symmetric kernel is 


13:3 


1 2 2 
E*n ta t) = = > (20 — gel) s(x — 2), 


a a eh 5, op t at) (2) gat) 
We shall say the pair (x{”, xi), (a{”, x$) is concordant if xi? — x3? and xy — xy 


are of the same sign. For computing « and var {k}, it is convenient to introduce a real 
parameter y, the probability that at least two of the three pairs which can be chosen 
from (XY, x), (X9, XE), (Xi, XJ”) are concordant. Show that the U statistic 


for estimating y when F(x) belongs to the class of continuous distribution functions has 
symmetric kernel A*(2x;, Xa, xa) Where A*(x,, %2, X3) = 1 if at least two of the three 
expressions 

eo! — xg ya - ap) a<fP;a, P=1,2,3 


are positive and h*(2,, Xe, Xa) = 0 otherwise. 
Assuming that F(x) is continuous, show that 
h* (x1, Xa, Ws) = Cya,12C2g,29Ca1,31 F C12, 12C29,29C51,13 F C12,12C29,38091,81 F Cr2,21C29,20Ca1, 91, 
where 
Caps = e(a — ap ay — x$"), 
and ¢(u) is defined by (5.25). Prove that 
E*l Xa, Xs) = 2h* (£i, Te, %) — 1, 


and hence show that 
t= 2y— 1, 
14, In the notation of Problem 13, prove that 
SHC, 2a) = 1 + 2F (a, ah”) + 2F(ed, 21") 
— ofa?” — a!) FP, 00) — Zele — af) Fe, 0) 
— efi? — a!) F(oo, wf") — 2e — zg’) FC, ws"), 
gt) = [1 — 2F(e®, JE — 2F(0, 22°01 

— 2F P(x, ©) — 2F(@, zi”) 
+4 f Fy), y!) dF(, 9) + 4 J EU, vay) dFU™, ©). 

Let f(g), f(g), to(g) stand for the variances, respectively, of gř, g2» 83 ; prove that 


blg) =1—«*. 
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Prove that n!/*(k — «) has a limiting normal distribution with mean 0 and variance ¢,. 
Prove that 


6 n =3 j 
(84) var tk) = a (" EO- D + CQ}. 


If X and X are independent, prove that « = 0, £(g) = 1/9, (g) = 7/18, & = 1, 
and hence that 


n? —3 


n= n(n — 1)(n — 2) p 


15. The rank correlation coefficient k’ was defined in Problem 11. Prove that 


(n — 2)? var {k} + 6(n — 2) cov {t,k} + 9 var t 


fk} = 
(8.5) var {k’} PET 
Prove that 
6 
cov {t, k} {n — 3)8, k) + St, kD}, 
n(n — 1) 
where 


fit, k) = cov (ft (X1), g: (WD) 
talt, k) = cov DEE X), g: (Xo X,)}, 


where f *, g* refer to symmetric kernels, respectively, of £, of k. 
Prove that, if the bivariate distribution of X = (¥", X'®) is continuous and corre- 
sponds to independence of ¥") and ¥), then 


i(k) = f(t) = 4, k) = 1/9, 
talt, k) = 5/9, 


2(n + 2) 
8. = —— 
(8.6) cov {t, k} 3n(a = 1)’ 
and hence 
(8.7) var {k’} = A * 


n=1 
Prove that 7™/?(k’ — K) is asymptotically normal with mean 0. Prove that 
[nt — 7), nV*(k— x) (or nk — x))] 


has a limiting normal distribution with means 0, variances 4£,(r), 9£,(k), and covariance 
6¢,(t, k). It is interesting to note that, in the case of independence as n> ©, the 
correlation between ¢ and k approaches one, and the limiting functional relationship 
3t = 2k holds. 

16. Let X,, ++, Xn be independent real random variables, X; having the continuous 
distribution function Fo,(x;), where 0, € Q indexes the continuous distribution functions. 
The problem of randomness with downward trend is 


Hypothesis: Fo (2) = F(a) = +++ = Fy (x); 0 EQ, 
Alternative: Fo (®) <-++ < F(x); (0, +++, 0n) E Q", 
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Mann [19] has suggested for this problem a test based on the number T of inequalities 
Ta < xp for x < B. Show that 


a wai Fe P) s(y— 9). 
2 a<p 


The U statistic, 
4T 


t= D 


2 > s(a — P) s(t, — x8), 


n(n — 1 gap 


1 


is the difference-sign correlation between 1, ++, n and ®, ***, a. Show that it is an 
unbiased estimate of the parameter 


aoe 
T= b 
n(n — 1) a 


a<p 
where 


Tag = s(% — P2 li Fog(2t) dFo f2) — 1). 


Prove that 7 = 0, <0 according as (0ı,***, 0n) belongs to the hypothesis, the 
alternative, n 

The random variables for the application of Hoeffding’s theorem are (1, X), san 
(n, X,), and hence, for both the hypothesis and alternative limiting distribution of f 
Theorem 5.5 is needed. Mann’s test is to reject the hypothesis if t < an where an is 
chosen to give the test size «. Prove that the test is consistent. Find a limiting form for 
the constant a,. What condition is needed to assure asymptotic normality of (t — 7)/ 
[var {r}}/2 under the alternative? 

17. Consider the sequence (x}", xy", x), ++, CP, al, al) and the n(n — 1) 
triplets of difference signs, 


s(t? — xl, (uss? — xl, (aie? — 29 


fora 4 B, œ, B = 1, +++, n, and assume that all 2's are different, all x‘*”’s are different, 
, i Pia P 
all x's are different. Prove that the regression functions for the trivariate “sign 


sample of size n(n — 1) are linear. . , 
Letitia, tia, tag be the difference-sign correlation between coordinates 1,2, SFT 
1, 3, and coordinates 2,3. Let fıs. be the partial correlation of s(vy’ — Tp 
seg’ — ax) re s(x?) — wg). 
hie — hates 3 
hes = a- 235)/2(1 ae t3,)¥2 


Similarly, if X, = (X, X”, X), Xe = (XP), XP), Xe") are i a eh 
aving the same continuous distribution over R°, then we can define P a a in 
Ti» 723 between the difference sign sX,” — Xz") (i= 1,2, 3), and also a p 
ifference-sign correlation 
Tie — T13T23 


712.3 Ce 71 me Ty) 


262 LIMITING DISTRIBUTIONS [6.8 


IE (XP, XY", XY), «++, (XP Xa XA) are independent and identically distributed 
according to a continuous trivariate distribution function, and if 73, + 1, 73, + 1, then 
prove that 

M?(ty9.3 — 712.3) 


has a limiting normal distribution with mean 0 and variance 


4 f | Ga — Tima)? 
T-a Gy 

5 (Tis — T2723)" Tas — Tisis 

+s)? £iltes) — 2 = — oe Calfa tha) 


Tia — Tete Gaa = TaTari — Tate 
13 2 itis. tog) n 2 3 12°13, 13 1 a) 


1 — 75 (1 — ris)(1 — 735) 
Use Theorem 2.7. 
18. Prove the relations (6.4) and (6.5). 
19. Prove the relations (6.25) and (6.26). 


20. Prove that the second statistic in Example 6.1 satisfies the conditions of 
Theorem 6.6. 


21. The randomized-block problem was introduced in Section 4.1, Chapter 3. We 
consider the formulation (4.2). Pitman proposed a conditional test, given the order 
statistic for each block, that is, given (Zian ***s Tirei) °° *> (ors * **s Totei). The test 
was based on the usual F statistic. Prove that, given the order statistics above, a 
statistic that is equivalent to the F statistic is 


(8.8) > Gia 
1 


b 
where Ē.; = b- 3 z; and &.. = (bc)! > xiz On the basis of Theorems 3.4 and 2.8, 
1 


i ü 


2 


altis tes) 


give conditions on the 2;, sufficient to prove, under the hypothesis of no treatment 
effects, that the condition distribution of the statistic (8.8), given the order statistics, 
has as b the limiting distribution of a constant times a z? variable on c — 1 
degrees of freedom. What constant? 

22. We consider another test for the randomized-block problem which is of correct 
size for hypothesis (4.5) or any previous hypothesis of Section (4.1) in Chapter 3. In 
each block the numbers a, «++, £; are replaced by their ranks rj, ***, re, and then 
the ordinary F statistic is calculated. Show that an equivalent statistic is 


> ( =D) 
E= > 
4 2 
b 
Fg = 6 > ry. 
i=1 


Under the hypothesis of no treatment effects find the limiting distribution of this 
statistic as b > o. 


where 
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23. Another test similar in construction to the one above is to replace in each block 
the numbers Ti, +++, tie by E{Zir,.)}, s E{Z(r,)}, where E{Zy} is the expected 
valued of the jth order statistic in a sample of c from the normal distribution with mean 
0 and variance 1. Do an analysis similar to that in the previous problem. 

PA: Consider the multivariate analog of the two-sample problem: (X,", «++, X), 

a CG, oy Xi) are independent, anid each has the same continuous histctaaton 
over R*, an = i SR te vast +) Xman) are independent, and each has the 
same continuous distribution over R*. The hypothesis is that the two distributions 
over R* are identical. For the corresponding problem, assuming normal distributions 
and the same variance-covariance matrix in the two distributions, Hotelling has suggested 


the test which is to reject for large values of the statistic T°: 


k 


mn, dii SEAS =H 
ee >. SHED — POED — FO), 
Ny + Ny 

ij=1 


where 


g&o = ni Wey 


. = i 
#0) =n) > a, 


and || s* || is the inverse of the matrix || sis], given by 


1 


A E Qis 
nı 
Qu = > @ = TOP — 0) 
a=1 
na 
+ > aa- BON — BM), 
a=1 
Prove that 
128 l- 1+ (m +m — 2)T°, 
where k 


NNa os BORO — 7) 
* = TE (et) — HONE 7), 
d Qu + n, + Ma 


and hence prove that T? is a monotone-increasing function of , 
2 


> stig — Rog — T), 


y= 


T*? 


Where || s*i || is the inverse of 


llsé l= Ia 2% ll. 
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For the nonparametric problem with alternatives involving only differences of location 
of the two distributions, a test suggested by Wald and Wolfowitz is a conditional test, 
given the order statistic {(aj", ---, sf), «++, Gang Thn) } and rejects for large 
values of T**, Prove, using mild restrictions, that under the hypothesis with probability 
one the limiting conditional distributions of T*? is z? with k — 1 degrees of freedom. 
Under what restrictions? For simplicity treat the case k = 2. 

25. For the run theory in Section 7 prove that 


n + m — Xi + 1)a; 
na + 1)(24) 
petii m — Dia; 


Ny + Me 
ny 
25. In Section 4, Chapter 5, the likelihood-ratio method was used to produce a test 


for the two-sample problem. Prove that the statistic on which the test is based has a 
limiting normal distribution. 


EGEri e) 
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CHAPTER 7 


Large-Sample Properties of Tests 


1. INTRODUCTION 


For most of the problems outlined in Chapter 3, the outcome can be 
represented as the result of a number of repetitions of a component 
experiment. It is convenient to call this number of repetitions the sample 
size. For small samples the theory of Chapter 5 can usually be applied 
directly to produce the constants determining a test’s size and to produce 
values of the power function for particular distributions of the alternative 
hypothesis. Also, for large samples the theory of limiting distributions 
developed in the foregoing chapter frequently gives approximations to 
these values. However, the accuracy of the approximation is often in 
doubt and depends on the speed with which a distribution approaches its 
limiting form. Unfortunately, for medium sample sizes, direct calcula- 
tions frequently lead to very tedious numerical work. Where these 
extensive calculations are unwarranted, a partial solution to bridging the 
gap between the results for small samples and the results for samples large 
enough for application of limiting distribution theory may in individual 
cases be obtained by experimental sampling from distributions constructed 
by the statistician. This is the Monte Carlo method. 

In this chapter we consider some general theory and some particular 
results for the power of tests when the sample size is large. 


2. CONSISTENCY 


If we are to consider the effect of letting the sample size increase, We 
must have a test defined for each sample size; that is, we must have a 
sequence of tests. Each of the tests developed in Chapter 5 was defined 
for any sample size, and hence each could be considered as a sequence of 

266 
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tests. For those distributions in the alternative hypothesis which are of 
particular interest, the minimum we can really expect of a sequence of 
tests is that the power should approach one as the sample size increases. 
This property was mentioned briefly in Chapter 2, Section 3.9, and was 
called consistency. However, in nonparametric theory, the classes of 
distributions are frequently of quite general mathematical form, and it is 
possible for some quite useful sequences of tests to construct mathematically 
peculiar distributions belonging to the alternative such that this requirement 
is not satisfied. For this reason we introduce a qualified definition of 
consistency. E 
Let Z” be the sample space, and {P¥|0 E Q} be the class of probability 

measures over Z”, Also, for the hypothesis testing problem, 

Hypothesis: 0 €w, 

Alternative: 0 EQ — v, 


let ġ„(x) be a test of size «. Then i 
The sequence of size-« tests {$,(x)} is consistent for & cQ—o if 


(2.1) lim P, (0) = 1 
no 

Sor 0 ef. 

We now consider some criteria for consistency. Let g(0) be a real 
valued parameter defined over 2, and suppose that this parameter dis- 
tinguishes between the hypothesis and the subclass ¢ of the alternative in 
the following simple manner 
(2.2) gO) =g if ew 

> &o ef. 


THEOREM 2.1. (LEHMANN). If in@p***> æ,) is a real-valued statistic 


defined over Z” for each n, and if for all 0 € Q, 
(2.3) Eft, (X, oes Xp} = g0), 
(2.4) Tim varo {tn(%15 °°" X)}= 0; 

g n> f 
where the limit is uniform for 0 € œw, then the sequence of tests {¢,,(x)} 0 
exact size q, 
(2.5) $,(x) = 1 t,(X) — 80 > Cn 

=0 = Cy 


1S consistent for ¢. 


Proof. Since vaty {t,(X)} ap 
easily seen by an application 0 


positive c, 
Pro {t,(X) — Zo > ¢ 


proaches zero uniformly for 0 €a, it is 
f Tchebycheff’s inequality that, for any 
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approaches zero uniformly for 9€w. Hence 
sup Pry {t,(X) — go > c} 
0Ew 


approaches zero as n approaches infinity. Then, since the tests are of 
exact size «, we have that 
lim sup c, < 0. 


no 


By a similar argument it is easily shown that, for each 0 € 4, 
Pro {t,(X) — go > c} 


approaches one for positive values of c less than g(0). It follows then that, 
for 0 € ¢, the power of the test ¢,(x), 


Pa (0) = Pro {t,(X) — go > Cn} 
must approach one as n —> 00. 


This theorem can be extended ina number of simple ways. For example, 
the conclusions remain valid for two-sided tests of the form 


(2.6) $AX)=1 if [nE — go] > cn 
= () Ss 
Also conditions (2.5) and (2.4) can be replaced by convergence in probability 
(2.7) Primo tn(Xa, * + °, Xn) = 80), 
provided this convergence is uniform for 0 in œw; that is, provided 
(2.8) Pro {go — € < (Xn + ++, Xn) < Bo + E) 


converges to one uniformly for 0 Ew. There is also an immediate 
analog for the two-sample problems where the limits are taken as the 
smaller sample size approaches infinity. 


EXAMPLE 2.1. THE Two-SAMPLE PROBLEM. We consider the Mann- 
Whitney test which was discussed in Examples 1.1 and 3.1 in Chapter 5. 
This test is to reject, for large values of the statistic, 


1 P F 
(2.9) V= am [number of pairs (x; En) With £; < En 


(i= 1,++ m; j= 1, +4, n)]. 
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By using the function c'(u) defined by 
(2.10) &@ =1 if u>0 

=0 <0, 


we can write 
Ny Ne 


1 
r= (£ 4, — X). 
me, Cent 8) 


2{=1j=1 


Chapter 4 as applied in Example 2.4 of that chapter, 


Then, by the theory of 
-variance unbiased estimate of the 


it is easily seen that V is a minimum 
parameter 
g(F, G) = Efe (Xaya — XD} 


= Pr {Xna — > 0}, 


where X}, X,+, are independent and have the distribution functions 
r 


F(x), G(2). 
From the minimum-variance property of V, it follows that the variance 


of V is less than the variance of the unbiased estimate 


1 min(ny,%2) 
A (Xn i X). 
min (m, n) 4 


The variance of this last expression is bounded by 


1 
_———— 
min (1, 2) 
Hence, the variance of V approaches zero uniformly as the smaller sample 


size approaches infinity. Then, by the two-sample extension of Theorem 
2.1, the one-sided Mann-Whitney test is consistent against alternatives 


having 

Pr (X1 — X > 0} > 1/2, 
and similarly the two-sided test is consistent against alternatives having 
Pr {Xn 1 — %1 > 0} A 1/2. 


d that the class of probability measures 
of continuous or absolutely continuous 
jude distributions having 


In this example we have assume 
Corresponds to all pairs (F(«), CŒ) 
distribution functions. The extension to inc 
discrete probabilities is straightforward. 
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3. A CRITERION FOR THE RELATIVE EFFICIENCY OF TESTS 


As we remarked in the previous section, most of the standard tests are 
consistent—consistent at least for those alternatives of particular interest. 
However, if we wish to compare two sequences of tests, we could examine 
the way in which powers approach the limit one. In Section 3.9 of 
Chapter 2 we proposed an expression to measure the relative efficiency, 
and it was based on the limiting behaviors of the power. Essentially the 
expression was the limiting ratio of sample sizes such that the power 
functions were equivalent. 

In general for nonparametric tests there is no simple measure of the 
relative efficiency based on the behavior of the power function for all 
parameter values of the alternative. Often, though, the statistician is 
particularly interested in the behavior of tests for some simple parametric 
class of alternatives. For most of the problems outlined in Chapter 3, 
this parametric class is the set of normal distributions used as the alterna- 
tive for the problem as given in normal theory. For these, the power 
usually approaches one. It seems reasonable then to choose from the 
parametric class a sequence that gets closer and closer to the distribution 
of the hypothesis. Let {0;} be such a sequence. Also let {hn}, {Px} be 
two sequences of tests all of the same size «, and let {n;}, {nF} be two 
increasing sequences of integers such that 


(3.1) lim Ps, (0,) = lim Py. (0,), 
i>o "i i>o "i 


with the two limits existing not equal zero or one (the limiting power of 
bn, at 9; must be the same as the limiting power of prs at 0,). Then the 
relative efficiency of {pn} with respect to {b*} is defined to be 


ime n 


(3.2) ello) (85) = lim E, 


if this limit exists the same for all sequences {nj}, {n7} satisfying (3.1). This 
isa definition of the relative efficiency corresponding to the sequence of 
alternatives 0; and is based on the reciprocal of the ratio of sample sizes 
giving the same power for that sequence. 

Under moderate assumptions the theorems in this section give a simple 
expression for the relative efficiency of two sequences of tests. For this 
we introduce some notation. Let ¢ be a subset of the parameter space Q, 
and assume that ¢ is indexed by a real parameter 6. Further, assume that 
ô = 0 gives a distribution in the hypothesis w, and that other 6’s correspond 
to distributions in the alternative Q — w. Now let 1,(x), (x) be two 


real-valued statistics defined over R”, and designate by £;{T7,}, o3{7,,} 
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and E,{7*}, o}{7*} the mean and variance of t,„(X) and r7(X), respectively. 
The theorems on efficiency will be concerned with the sequences of tests 


{4,(x)}, {dn (x)} defined by 


(3.3) $,(x) = 1 if 18) She 
=a Tae 
= 0 = los 

(3.4) px(x) = 1 if 2S fF, 
=j = 
=0 ths 


where t, ,, (*,, are chosen to give the tests size « according to the distribu- 


a nya SNe 
tion given by 6 = 0. 
First we have two theorems on the limiting power of a sequence of 


tests {,(x)}. 
THEOREM 3.1. If for ô= 0, 


d 
—E,{T,,} > 0, 
(3.5) mE A 
d 
(3.6) zg PT l= 
: lim Ror) C, 
n> I oof K 


A fll? 
and, if for the sequence of alternatives D= hn, 


== 1, 


(3.7) 


no 


7 E{Tn}|o=0 
lim Gogh Tn} =] 
6.8) sists oT nt 


meter value ô, = k/n? (with k > 0) Tp is 
{7,,} and variance o3,{T,,}, then the 


> 


and, if corresponding to the para 
asymptotically normal with mean Es, : 
limiting power of the size- test ,,(x) is 


(3.9) 1 — D(z, — ke) 
where 1 — @(z,) = « and (2) is the standardized normal distribution 


function. 
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Proof. Because T, is asymptotically normal for k = 0, we have 


lim tna — E,{Tn} =F 
— = Ae 
wee alla} 


Also, because T, is asymptotically for k > 0, we have that the limiting 
power is 


1— 0@), 
where 
3 = lim fm ET 
no o6, {Tn} 
Now, since 
BAT} = ET + als $ EAT,)] 5 0<b <b 
b= 
then 
k fd 
nee O5,{Tn} 


Tr}| 4 
= Üm, tee = 20 onllak — fies =e Fa l, =ó 
wre Gly} dT} > o5,{Tn} 
=41,—ke. 
This completes the proof. 


As a generalization of this theorem we have 


THEOREM 3.2. (NOETHER). If for 6 = 0, 


d an 
3. = ope ü 
(3.10) z ET =i  E,{T,} = 0, T a” EAT) > 
De as : 
(3.11) apn FalT }|o=0 


m ——_— = 
nn, wT c  forsome y> 0, 


and, if for the sequence 6,, = k/n”, 


d™ 
gr PT 
3.12) i ee es 
a™ 
a, — oT n} 
T. 
6.13) tim een, 


no OT} 
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and, if corresponding to the parameter value ô, = k/n” for k>0, T, is 
asymptotically normal with mean E;,{T,,} and variance o3 {Tn} then the 
limiting power of the size-x test ¢,,(x) is 


(3.14) 1—©® (+. — a ‘ 


Proof. The proof corresponds to that for the previous theorem. 


The next two theorems are concerned with the relative efficiency of 


Sequences of tests. 


THEOREM 3.3. (PiTmAN). If {7} and {7p} satisfy the conditions of 


Theorem 3.1, then the relative efficiency of {¢,} re {3} is 


d 2 
7 E{Tn}lo=0 {T} 


{oTa} 


6-15) tim | 5 
n— oO 3. J gad = 
J Est Blo 0 


Proof. The two tests will have the same limiting power if 


ke = pret: 


The sequences of alternatives will be the same if 
k _ k 


Seas 
a ma T pe 
Ollows then that 
n* ( c j 
n \c 
3 


p 2 
s Eloo | otri 


= ia em) ers” 
ala oT» 
i a E<T}|s=0 


ne conditions of 


T?} satisfy tl 
T,} and {Tn} ficiency of {Pn} 


Theorem 3.4. (NoeTuer). If { 


ene 3.2, and if y =y*, M = m*, then the relative € 
{r} is | 

m 7 1 

E gra AD fom 


lim] = ym 
ise att T EAT} o=0 
= 3.3. 
Proof. The proof corresponds to that for Theorem 
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THEOREM 3.5. Theorems 3.3 and 3.4 remain valid as stated if the tests 
are two-sided tests of the form 


¢,(x) = 1 if t,x) > thr, or 
=l if ha >t) 
=0 F the > EOS tw 
where «’, «” are fixed proportions, the same for each sequence of tests. 
Proof. Straightforward. 


In the statement of Theorems 3.3 and 3.4 there is nothing essential in the 
requirement that the limiting distribution be normal. It could be any 
other distribution form having scale and location parameters. 


EXAMPLE 3.1. THE SINGLE-SAMPLE PROBLEM OF LOCATION. Let 
Xi» * ++, Xn be independent and each have the same continuous distribution 
function F(x). The usual form of the location problem uses the median as 
location parameter and is given by 
(3.16) Hypothesis: &;(F,) = &, 

Alternative: &5(Fo) > £o- 

In parametric theory, when the distributions are assumed normal, the 
usual test for this problem is the t-test. From Chapter 2 we know that it 
is most powerful similar, most powerful invariant, and most stringent. 
The most powerful test for the nonparametric formulation was derived in 
Section 2.1 of Chapter 5. It is the sign test. We calculate now the large- 
sample efficiency of the sign test with respect to the ¢ test when the under- 
lying distributions are normal. 

Both tests are invariant under a change of scale about the origin. Hence 
it suffices to evaluate efficiency when the underlying distribution being 
sampled is normal with mean 6 and variance 1. Also, without loss of 
generality, we can let &) = 0. 

If we use the function c’(w) defined by (2.10), the sign test is to reject for 
large values of the statistic 


(3.17) Dn = a, c'(@,). 
We have 
(3.18) EV) = Esle'(X)} 


L gä [roo |- pe a" A 


=. L ( 2) d 
=o l LAN aA 
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1 
Gaye 


a 
(3.19) = EXV,)|s-0 = 
== (2x), 
We calculate the variance of V,,: 
5 Hi us 
osl Vn) = = F(x) 
n 
1 
=-p(l—p), 
a20 —P) 
where p is given by the expression (3.18). If 6 = 0, then p = 1/2 and 
ps 1 
ool Vn) = Tni 


Also by Theorem 3.5 in Chapter 6 it is straightforward to show that the 
induced distribution of the statistic v is normal. 
The ż test is to reject for large values of the statistic 


a 


where 


The denominator of w converges stochastically to 1, and hence an asymp- 
totically equivalent statistic is 


wo Gh. 
We have 
EW’) = 6, 
n EW’) =1. 
Also we have 
PERSA 1 
o3(W’) = F 


By Theorem 3.5 in Chapter 6 it follows that the induced distribution of # is 
normal. 
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Then by Theorem 3.1 the large-sample efficiency of the sign test with 
respect to the ¢ test for normal alternatives is 


Fr es 


1 1/4n m 


In [4] Dixon considers the power function of the sign test for small samples 
and shows that, for approximate agreement of the power functions, the 
ratio of sample sizes is approximately 0.95 when the sign test is based on a 
sample of 5, 0.80 for a sample of 10, 0.70 for a sample of 20. This indicates 
a high efficiency for small samples which gradually decreases to a limiting 
efficiency of 0.637. 


4. THE EFFICIENCY OF SOME CONDITIONAL TESTS 


In Chapter 3, Section 2.2, we developed a technique for finding similar 
tests most powerful for simple alternatives. This technique depended on 
having a statistic that was sufficient and complete under the distributions 
of the hypothesis. The resultant test could be described as a conditional 
test, given the statistic. In this section we develop some theory which 
enables us to show that a number of these conditional tests are asymptoti- 
cally as efficient, when the distributions are normal, as the corresponding 
tests of parametric theory. 

Consider the sample space 2(.7), the class of probability measures 
{P,|9 € Q}, and the hypothesis testing problem 


Hypothesis: 0 Ew, 
(4) yp i w 


Alternative: 0 eQ — w. 


Suppose that (x) is a statistic which is sufficient for the probability 
measures of the hypothesis, {P,|9 €w}. Let s(x) be a real-valued statistic. 
Then the type of test mentioned above has the following form 


d(x) = 1 if s(x) > Cua) 
(4.2) = Aka) = Cha) 
=0 < Eijs 


where the ‘constants’ 4)» Ca) are chosen so that, under the hypothesis, 
the test has conditional size «, given the statistic r(x). 

For convenience we introduce some additional notation to describe the 
test (4.2). From the assumptions above it follows that, under the hypo- 
thesis, the conditional distribution, given the statistic t(x), does not 
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depend on the parameter 0 €w. Hence, under the hypothesis, there is a 
single conditional distribution for s(x), given a value for i(x). This 
conditional distribution is important because it is with respect to it that the 
test was given size œ. Let F(s;x) be the distribution function for this 
conditional distribution; we have 


(4.3) F(s; x) = Pi {s(X) < s|t(@)}, 


where the subscript w is used to indicate the single hypothesis distribution, 
given a value for t(x). We introduce a symbol S(x) designating a random 
variable having the conditional distribution F(s; x): 


(4.4) F(s; x) = Pr {S(x) < s}. 


Also we wish a symbol to designate a certain percentage point of this 
distribution. In Chapter 2, Section 2.1, we introduced the symbol 
(Y) to designate the p percentile of the distribution of the real random 
variable Y. However, for a description of the test (4.2) we need a point 
exceeded with probability «; hence with some apology for the notation we 
define 


(4.5) E) = §.(S()). 
The test (4.2) then takes the form 


gz) =1 s) > EL) 
(4.6) = Atta) = $ (x) 
=0 < &,(); 
where aya is chosen to give the test size « under the hypothesis and is 
given explicitly by 
a — Pr {S(x) > &(@)} 
Pr {S(@~) = E,@)} - 


To illustrate this notation we refer to Example 2.1 in Chapter 5. There 
the Pitman two-sample test was derived as the most powerful similar test 
against alternatives involving normal distributions. The statistic Ke) 
was the order statistic for the combined sample. A number of definitions 
for s(x) were considered, but at the end of the example it was taken to be 
the usual two-sample ż statistic, and this is the most convenient form for 
Our purposes here. The conditional distribution, given a value for t(x), is 
equal probability to each of the points obtained by permuting the co- 
ordinates in the order statistic. There are (74 + 2)! such permutations. 
Under the hypothesis that the two samples come from the same distribu- 
tion, it is seen that the induced conditional distribution is discrete—is 


(4.7) Aia = 
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equal probability to each of (7, + n)! values of s(x) derived under 
permutations of the coordinates in the order statistic (of, course, not all 
of these values are different). 

We use the framework of this example to indicate the direction in which 
we shall develop the theory in this section. The statistic s(x) is the two- 
sample ¢ statistic. For the two-sample problem involving normal distribu- 
tions and a one-sided alternative, the most powerful similar test is the 
t test and is given by 


$*(x) = 1 Sz) > Sa 
= (0 Hes 


where s, is the point exceeded with probability ~ according to the ¢ 
distribution with n, + na — 2 degrees of freedom. We shall show for a 
class of alternatives including some normal distributions that &,(X) 
converges in probability to s, as the sample sizes increase; also that the 
limiting distribution of s(X) is continuous at the limiting value of s,. 
From this it follows quite easily that, for this class of distributions, the 
tests (x) and 4*(x) are asymptotically equivalent and hence have the 
same limiting power function. This then almost immediately implies 
that Pitman’s test for normal alternatives is asymptotically as efficient as 
the usual ¢ test.. 

We return to the general model introduced earlier. Our results in this 
section are concerned with limiting distributions and relative efficiency as 
a parameter n approaches co. Each of the symbols introduced can 
depend on n; however, it is not convenient to put a subscript n on every 
symbol introduced, but we shall try to use it where it is most essential. 
Hence the test (4.6) for sample size n is given by 


$, (2) = 1 Salt) > E,,n(%) 
(4.8) = Qua) = €,,,(%) 
= 0 < Sanla). 


Also suppose that $*(æ) is a related test of the form found in parametric 
theory, 


prle) =1 Sa) > San 
(4.9) =a = 
=0 < Sam 


where the constant s,,, is chosen to give the test size « for a distribution 
6° in the hypothesis ,,. 


THEOREM 4.1, (HogFFDING). For the sequence of distributions {0,} 
for X,, if F,(s; X,,) converges in probability to a distribution function F(s) 
at every point of continuity of F(s), if F(s) = 1 — « has a unique solution 
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5 =s,, a point of continuity of F(s), then En n(Y) converges in probability 
tos, And, if there exists a function H(s) continuous at Sa Such that 
(4.10) Pror, {5,(X,,) < s}—> H(s) 
at every continuity point of H(s), then, for the sequence of distributions 
{0;,}, the power of ¢,,(z) converges to 1 — H(s,). And, if Sy n converges to 
Sa then, for the sequence {0/,}, the power of 6* converges also to 1 — H(s,). 

Note. It is of interest to emphasize that F,(s;2) is the distribution 
function corresponding to the hypothesis conditional distribution of 
s(X), given r(x). On the other hand, H(s) is the limiting form of the 
marginal distribution of s,(X) under the sequence of alternative distribu- 
tions {0}. 

Also the assumptions of the theorem are much stronger than necessary 
for the equality of limiting powers for the sequence {0;}. The proof will 


indicate the modifications that can be made with the results remaining 
valid. 


Proof. We first show that &,,,(X,) converges in probability to s,. 
From the definitions of Én n€) and F(s; x), it follows that 
a < Fals, 2) 
implies that 
anl) < S, 
which implies that 
æ < F,(s, x); 
hence 
(4.11) Pro, {Fy(s3 X,) > a} < Pror, Exa Xn) <5} < Pror (Fals; Xp) > 0}. 
Now, if s is a continuity point of F(s) and if s < s,, then from the assump- 
tions we have 
(4.12) p-lim F, (s; X,) = F(s) < F(s) = % 


no 


and hence the outside terms of (4.11) approach zero; hence 
Pror, (Ean Xn) < 5} 0, 
if 5 <s, And, ifs isa continuity point of F(s) and if s > s,, then from 
the assumptions we have 
(4.13) p-lim F,(s; X,) = F) > FG) =% 


n= 


and hence the outside terms of (4.11) approach one; hence 


Pror, Lfanan) z s} sag 1, 
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if s>s, It follows then that &,.,(¥,,) converges in probability to Sa. 
We now show that the power of ¢,(x) approaches 1 — H(s,). From 
the definition of ¢,,(x) in (4.8), we have 


(4.14) Pror ASX n) > Fain Xn} < Eor {bu Xa} S Prov ASX) = Ean Xn) 


From the assumptions, we have that H(s) is continuous at s, and 
Pr {s,{X,) > s} Pr {s,(X,) = s} converge to 1 — H(s) for continuity 
points s. Hence Pr {s,(X,,) > s}, Pr {s,(X,,) > s} can be made arbitrarily 
close to 1 — H(s,) by choosing n large enough and s close enough to Sa 
Then, since E, „(X,„) converges in probability to s,, it follows that the outside 
terms of (4.14) converge to 1 — A(s,), and hence that 
lim Ey, {$(X,)} = 1 — H(s,), 
n>n 
Since it is assumed that s„„ converged to s,, it follows trivially that the 
power of $*(x) converges also to 1 — H(s,). This completes the proof. 
The theorem above has been stated for one-sided tests, but it extends in 
a straightforward manner to cover the two-sided tests. 
The next theorem gives a simple procedure for checking whether a 
distribution function F,(s; X) converges in probability to a distribution 
function F(s). 


THEOREM 4.2. (HoEFFDING). A necessary and sufficient condition 
that F,,(s; X) converge in probability to F(s) is that 
(4.15) Pr {S,(X,) <3} > FOS, 
(4.16) Pr {S,(X,) <8, S,(X,) < s} > F%). 
where S,(x), S,(x) are independent and identically distributed random 
variables defined by (4.4). 


Proof. Problem 3 in Chapter 6 was to show that, if the mean and 
variance of a random variable Y,, converge, respectively, to c and 0, then 
the random variable Y,, converges in probability to c. An equivalent 
condition is that E{Y,}, E{Y2} converge, respectively, to c, c®. If the 
random variables are uniformly bounded, then it is trivial to show that the 
converse also holds. Then, since 0 < F,(s; X) < 1, the convergence in 
probability of F,,(s; X,,) to F(s) is equivalent to 

E{F,(s; X,,)} > F(s) 
ELF, (s; X,)P} > F%). 
The theorem then follows by noting that 
(4.17) F,(s; £) = Pr {S,(2) < s} 
(4.18) F2(s; x) = Pr {S,(x) < s, S,(z) < 5}. 
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EXAMPLE 4.1. THE PROBLEM OF LOCATION, GIVEN SYMMETRY. The 
problem of location, given symmetry, was described in Section 2.2 of 
Chapter 3 and was designated by (2.7) and (2.8). Let Xp +>, X,, be 
independent and each have the same absolutely continuous distribution 
which is symmetric about the median. The problem is to test the 
hypothesis that the median is zero against the’alternative that it is larger 
(one-sided problem), or that it is not equal to zero (two-sided problem). 

Problem 11 of Chapter 5 was to apply the theory of Section 2.2 of 
that chapter to finding the most powerful similar test for the one-sided 
problem against a normal alternative and a most stringent similar test for 
the two-sided problem with normal alternative. The test was a condi- 
tional test, given the statistic t/(x) = {|x| ey KAR and was to reject 
for large values of the statistic s/(x) = Sv; for the one-sided problem and 
large values of |s/(x)| for the two-sided problem. Since both t/(x) and 
S(x) are symmetric in the 2’s, it is equivalent to construct the test as a 
conditional test, given f(x) = (|2,|,°++, |, |). Under the hypothesis the 
conditional distribution, given #,(x) = dal, seg |en), is equal prob- 
ability to each of the 2” values of (+2, **', +2,). If we let G, be a 
random variable taking the values +1, —1 each with probability 1/2, then 
the random variable S,,(x) can be described by 


(4.19) Six) = (Gits +++, Gat) 

We now replace the statistic s/(x) by an equivalent statistic so chosen that 
its conditional distribution under the hypothesis has mean 0 and variance 1 
(unless all the «,’s = 0): 


n 


(4.20) mites >a ( Eat y 
1 


We designate by ¢,(x) the one-sided conditional test based on s,(x), and 
for normal alternatives we compare its power with the power of the t test 
Which is of course most powerful similar for the problem in terms of 


Normal distributions. ; : 
_ Let ¥,=G,¥, and Y| = GX, where Gp''' Gy G's G,, are 
independent and identically distributed with probability 1/2 at each of 
+1, —1. Then y? = Y= X?, and 


n n -1/2 
(4.21) S,(X) = n™? > Y, |- 2 x i 


i=l j=1 


n n -1/2 
(4.22) SUX) =n”? > r| > x] ; 


i=l jmi 


282 LARGE-SAMPLE PROPERTIES OF TESTS [7.4 


Now consider the case where the common distribution of the X; is 
normal with mean yw and variance o*. By Theorem 3.1 in Chapter 6, 
n1 XX? converges in probability to o? + u?. Hence, by Theorem 2.6, 
Chapter 6, (S,,(X), S„(X)) has the same limiting distribution (if any) as 


n 
(4.23) G sb uyn > Y, (0? + p22 E Y; |. 


i=l 


The vectors (¥;, Y1), +, (Y,, Y„) are independent and identically distri- 
buted, and 


E(Y,) = E(¥;) = 0, 
ECY?) = E(Y;?) = 0? + pè, 
E(Y,¥)) =0. 
Then, by Theorem 3.5 in Chapter 6, the random vector (4.23) has the 
limiting distribution function ®(s) P(s’), where ®(s) is the normal distribu- 


tion function with mean 0 and variance 1. Then, by Theorem 2.8 in 
Chapter 6, the limiting distribution of 


(0? + u irn DY, 


has the distribution function ®(s). Hence, by Theorem 4.2, F,,(s; X,) 
converges in probability to @(s), and, by Theorem 4.1, &,,,,(X,,) converges 
in probability to s,, where 1 — (s,) = «. 

By the same type of argument, we find that the limiting distribution of 


Sn(Xn) — nulo? + wv? 
[1 + (wo) ™? 


is normal with mean 0 and variance 1. Then, if #/ø is positive, it follows 
that 


(4.24) 


H(s) = lim Pr {s,(X,) < s} 


= 0; 


and hence the power of the test tends to one. 

Now let u, o depend on n in such a manner that (/o)n/* approaches a 
constant 6. The distribution of the (Y;, Y;) depends on n, but we are able 
to repeat the above argument, using the more general central-limit 
Theorem 3.5 in Chapter 6. Since E{| X;|*}o-? = o(n™?), the conditions 
of the theorem are satisfied, and we obtain 


H(s) = ®(s — ô). 
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The most powerful similar test of normal theory is the ż test, and is 
based on the statistic 


neg 
(4.25) | U 7 :| 172° 
=o (x; — 2) 
But, since 
Da, 
5,(X) = z 


n&/[X(@; — 2)? 


~ T+ neie, — a 


( n yr neg 
n—1 [ 1 


se-a” 


neg 2 


n—1 


> 


12 =D m 
[ X(x; — a] 
n—1 

it follows that s,(x) is a monotone-increasing function of the ¢ statistic and 
hence can be used equivalently to form the normal theory test which we 
designate by 2(x) in accordance with our formula (4.9). oll 

Our argument above shows that, under the hypothesis normal distribu- 
tion with mean 0, the limiting distribution of s,(X) is normal with mean 0 
and variance 1. Hence s,,, converges to s,, defined by 1 = 5.) =a. 
By Theorem 4.1 the tests (x) and (x) have the same limiting power. 
Then by Theorem 4.1 it follows that, for normal alternatives, the relative 
efficiency of the nonparametric test ¢,(x) with respect to the f test is one. 


EXAMPLE 4.2. THE RANDOMIZED-BLOCK PROBLEM. The randomized- 
block problem was described in Section 4.1 of Chapter 3. , Let 
X= (Xp, X,), where X; = (Xm * s Xie) (= 1, + +7, n) are n inde- 
Pendent random variables with absolutely continuous distributions over 
R° (e > 2), and let the hypothesis be one of those designated by (4.1), 
(4.2), (4.3), (4.4). Each of these hypotheses implies that the distribution 
Of each YX, is invariant under the c! permutations of its coordinates 
(Xas oan: X, ). In accordance with the theory of Section 22 of Chapter 5 
a similar szeg test can be constructed as a conditional size-« test, given the 
Statistic 
ty(x) = (try °° teh e aa Tno})s 


Also it can be shown that, for a suitable class of distributions for the 
hypothesis, all similar tests have this conditional form. 
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A test ¢,(x) proposed by Pitman [7] is a conditional test given 7,(x), and 
is to reject for large values of the ordinary F statistic for testing treatment 
against error. This F statistic, 


7 m| Ei — z| e = 


(4.26) Jai ae 


ps Sex — &,)* E. -| (ti — z| [er (e—1) , 


i=lj=l j=l i=l 


varies both in numerator and denominator, given f,(x). However, it can 
be written in the form 


Sa (x) => 
1 —5,(x) 0 — D (e— 1)” 


e 


where €;, =e? > a,,;, and 


c n 2 
| (zy — z] 
(4.27) 50) = 


From this it is easily seen that s}(x) is an equivalent statistic to use both for 
the ordinary analysis of variance test and for the conditional test. Also for 
the conditional test it has the added advantage that the denominator is 
constant-valued under the permutations of the conditional distribution, 
given ¢,(x). To fit in with the use of the theorems of this section, it is 
convenient to make a further trivial modification and use the statistic 


c 


Soo 


(4.28) 5,(X) i=l 


n c j 


ge >, (—1)> > (tiz —- 24)? 


i=l j=1 


n 
where u,(x) = n™? = (oy — &,) for j= 1,+*+,¢. 


i=1 


We now use the theory of this section to evaluate the limiting power 
under the normal distributions of the usual analysis of variance alternative. 
For this we now let 


(4.29) Xa = Yay + Bit Th 
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where the Y,, are independent and identically distributed with means 0 and 
variances o?, and the ĝ; and the 7, are the block and the treatment effect 
constants. Also we take c fixed and let n —> oo. 

Consider the denominator of s,(x). The random variables 


C-D Xs- G= Lym), 
jal 
are independent and identically distributed with mean o°(1 + 62), where 


ò? = oe — 1) > (4, — DY, 
j=1 


and 


c 
t= ASh 
j=l 


By Theorem 3.1 in Chapter 5, it follows that 


my (e — 1) > (Xi; in z) 


converges in probability to o?(1 + 6%). Also it is invariant under the 
conditional distribution permutations. a 
Now to apply Theorem 4.2 we need the limiting distribution of 
(Sx), SX), 
where 
S,(X) = 5,(GX) , 
Si(X) = 5,(G'X), 
and G, G’ are independent and identically distributed random variables and 
are such that G applies to x with equal probability each of the (c!) 
Permutations of the hypothesis conditional distribution, given ¢,(x). By 
the above paragraph it follows that 
(S,(X), SX) 
has the same limiting distribution (if any) as does 
(si(GX), si (C'X)), 
where ’ 
s*(x) = o-°(1 + i a > w(x). 
j=i 


We can write 
; 


uw) = > (5. = 


g=1 


) v(x), 


ale 
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where 6,, is Kronecker’s delta and 


n 
v(x) = n™? 2. (Tir — 


i=l 


Let V, = v (GX), Vj = v;(G'X). Toward finding the limiting distribu- 
tion of (sx (GX), nC X)), we now investigate the limiting distribution of 
Vote Va Vistt', Ve The random vector nY2V = n¥2(V,,-++, Vo 
Vi, ++, Vi) is the sum of n independent random vectors, each of which has 
the distribution of 


Z= (Zr, a7 FS Zro Zro HR. Zr), 


where Z}, **', Z, are independent, Z, has the distribution of Y;; + t; and 
(Ry, +", R) and (Rj, > ++, Ri) are two independent random variables whose 
values are the c! permutations of (1,---,c), each taken with the same 
probability. By the central-limit Theorem 3.3 in Chapter 6, it follows that 
V — E(V) has a limiting normal distribution with means 0 and the same 
covariance matrix as Z. If ô? and o° are allowed to depend on n, then the 
more general Theorem 3.5 in Chapter 6 is needed. Then by Theorem 2.8 
in Chapter 6 it is straightforward to show that the limiting distribution of 
(s*(GX), s¥(G'X)) is that of two independent y? random variables with 
c — 1 degrees of freedom. (See Problem 6.) Now, applying Theorems 
4.1 and 4.2, we find that &(X) converges in probability to s,, the point 
exceeded with probability « by a y? random variable with c — 1 degrees 
of freedom. 

The results in the paragraph above also remain valid when 6 = n™?k 
and k is independent of n. The limiting distribution of s}(X), for this 
sequence of alternatives can be obtained in the manner used above and is a 
noncentral y?. The usual test of normal theory has the general form 
given by (4.9). Therefore, when the alternative distribution is normal as 
defined at the beginning of this example, Theorem 4.1 proves that Pitman’s 
conditional test has the same limiting power function as does the ordinary 
F test, and hence the relative efficiency is one. 


EXAMPLE 4.3. THE PROBLEM OF RANDOMNESS WITH REGRESSION 
ALTERNATIVE. THE Two-SAMPLE PROBLEM. The problem of randomness 
was described in Section 3.3 of Chapter 3. In this Section we consider a 
conditional test designed for the problem of randomness with regression 
alternative. The test is an analog of Pitman’s two-sample test, Example 
2.1 in Chapter 5, and was mentioned in Problem 10 of that chapter. 

Let X}, ***, X, be n real-valued random variables defined by 


(4.30) X,= ce; + Yi, 
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where Y}, +++, Y„ are independent and each has the same absolutely 
continuous distribution. The z constants c; are the values of the inde- 
pendent regression variable. The hypothesis that = 0 is to be tested 
against the one-sided £ > 0 or the two-sided alternative + 0. 

Let the statistic t,(x) be the order statistic for the set of n numbers 


ait, > syd 
t,(X) = (ays s Tew). 


Under the hypothesis the conditional distribution, given 1,(x), is equal 
probability to each of the z! permutations of the numbers in ¢,(x). The 
Pitman test is a conditional test, given ż,(x), and is to reject for large values 
of the statistic s„(x) for the one-sided alternative and for large values of 
|s,(x)| for two-sided alternatives, where 


5 (ci — ©); 


z 1 
50) = San — Ee T 


(4.31) 


and ë = nc, =n Za; The one-sided test then has the form 


$,(x) = 1 if sa) > San(X) 
= lx) = &,,(x) 
=0 < Ean(X)s 


and &,,(x) is chosen to give the test size « according to the hypothesis 
conditional distribution of s,,(x), given ¢,,(x). 

The usual ¢ test for the analog of this problem in normal theory can also 
be expressed in terms of s,,(x). For we can write the ¢ statistic as follows: 


X(c; — é)x; 1 
Ele; — ae [ Se, — 3? — [E(c; — @2;,]? lle (n — 2)- 1/2 


X(c; — 6)? 


S, (x) 
= n- 2) SO 


Then, because —(n — 1)"2 <5,(x) < (n — 1)", this expression for the 


t statistic is a monotone-increasing function of s,,(x). Also it follows that 
the absolute value of the z statistic is an increasing function of |s,(x) |. 


Hence the one-sided test of normal theory can be written 
(x) =1 if sa(X) > Sna 


(4.32) 
=0 Saas 
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where Spa is chosen to give the test size « under the hypothesis using normal 
distributions. 

We consider the limiting value (if any) for &,,,(X) under sampling from a 
hypothesis distribution. Let £=0, var{Y,}>0, and E|Y;|? < o. 
Then by Theorem 6.8 in Chapter 6 it follows that, if either 


max (c; — é)? 


(4.33) CT; ee 
or Y; is normally distributed, then 
(4.34) F,,(s; X)—> ®(s) in probability 


as n—» œ, where ®(s) is the cumulative for the normal distribution with 
mean 0 and variance 1. From this it follows that, under the above 
assumptions, 
(4.35) p-lim &,,,(X) = 2 

n> wo 
where z, is the point exceeded with probability « according to the standard- 
ized normal distribution. It also follows immediately that s,,,— 2, as n 
approaches infinity. 

We consider the limiting values of &,,,(X) (if any) under distributions of 
the alternative. Assume that X; = d; + Y; where Y,,---, Y, are inde- 
pendent and identically distributed, E{| Y,|°} < oo, var{Y,}>0. By 
Theorem 6.9 in Chapter 6 it follows that, if * 


(a) 
(4.36) Y, is normal 
or 

max (c; — é)? 
4.37 — AA 
la l g Y 
and 
() f 

a Ae ' Ave 

(4.38) max (c; — ĉ) , max (d; — d) et 


Xe; -— 8? Xd; — d} 
then, as n—> œ 
F(s; x)—> ®(s) in probability; 


that is, the probability approaches one that the distribution function of $,(x) 
is within any preassigned amount of the normal distribution with mean 0 
and variance 1. S,,(x) was defined in general notation by formula (4.4). 
From this it follows by Theorem 4.1 that, under the assumptions above, 
(4.39) prlim €,,(X) = Za 


n+ oO 
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We now consider the limiting distribution of s,,(X) under the distributions 
of the alternative mentioned above. If E{| ¥,|3} < 0, var {¥,} > 0 and 
either X; is normal or 


max (c; — ¢)? 
then it follows easily from the central-limit Theorem 3.4 in Chapter 6 that 
the limiting distribution of s,(X) is normal. The limiting power of the 
conditional test can then be obtained by using Theorem 4.1. 

We now compare the ¢ test with Pitman’s test when the distributions are 
normal. Let ¥;=d, + Y, where Y,,---, Y, are independent, and each 
has the same normal distribution. It follows by Theorem 4.1 and the 
results above that, if the power of the z test tends to a finite limit, then 
Pitman’s conditional test converges to the same limit. Hence the relative 
efficiency of the two tests is one. 


— 0, 


5. THE EFFICIENCY OF A RANK TEST 


In Chapter 5, Example 3.2, the invariance method was applied to the 
two-sample problem. From the invariant or rank tests a test was chosen 
having locally maximum power against normal alternatives; it was the c} 
test developed by Terry. In this section we prove that the c, test is 
asymptotically as powerful as the best test; the ¢ test of normal theory. 
Following Hoeffding [8], however, we prove the stronger result that the 
analog of the c test for the problem of randomness with regression 
alternative is asymptotically as efficient as the ¢ test for this regression 
alternative. This more general c, test was introduced in Problem 16 of 
Chapter 5. 

Let X}, +++, X, be independent, and let X, be normally distributed with 
mean ëc; + and variance o*. For the analogous nonparametric formula- 
tion see Section 3.3 in Chapter 3. The hypothesis testing problem is to test 
the hypothesis; € = 0, against the alternative; œ 0, or the alternative, 
€=£0. When ø is known, the standard test of parametric theory is based 
On the statistic 
Uc; — 6)x; 
sie nO = Fee — OF 

n 
where é = > c; It is easily seen that the induced distribution of this 
i=l 


Statistic is normal with mean 


(5.2) 5, = E Ele; — °]2 
o 


290 LARGE-SAMPLE PROPERTIES OF TESTS [7.4 


and variance 1. When o is unknown, the standard test is based on 1,,(x) 
with o replaced by its unbiased estimate based on the x;s. By the results 
in Example 4.3, this modification of t,(x) has an induced distribution 
which is asymptotically normal with mean (5.2) and variance 1. The cy 
test of Problem 16 in Chapter 5 is based on the statistic 


_ Bc; — DE(Zey} 

su 29 = Ea’ 

where r = r(x) = (r,,°++,7,) is the rank statistic, giving the ranks of 
%4,°**, , and where Zay ** *, Zin are the order-statistic random variables 
for a sample of n from the standardized normal distribution. In the 
remainder of this section we shall prove that, if 6, is bounded then ¢,(r) 
also has an induced distribution which is asymptotically normal with mean 
6,, and variance 1. This equivalence of the limiting distributions of the 
two statistics implies that the c; test is asymptotically as efficient as the 
usual ¢ test. 

By observing the form of (5.3) we see that we can assume that ¢é = 0 
and Nc? = 1. Also without loss of generality we assume that 7 = 0 and 
o* = 1. Now, applying Theorem 3.3 Chapter 5 in the same manner used. 
in Example 3.2 of that chapter, we find 


1 [2 exp [—HZy) — Ec? 
(5.4) Pr {r(X) = r} ETT exp EK = a } 


1 2 = 
= (—3*Xc7)Ef{exp [Ze,Z,,]} 


1 3 
Ss exp (- 2) efexp [6,2¢,Z;,]}- 
Also, if ,(t, ô) designates the characteristic function of c,(r(X)) — ô» 
then we have 


(5.5) $,(t, ôn) = E, exp [ite,(r) — it6,,] Pr {r(X) = r}. 
To prove that, when 6,, is bounded, ¢,(r(X)) has a limiting normal distribu- 
tion with mean 0 and variance 1, we shall prove that, for every t E€ Rs 
(5.6) lim ¢,(t, d) = exp (—1?/2), 

no 
and that the convergence is uniform for d bounded. Then, since 
exp (—¢?/2) is the characteristic function for the standardized normal, the 
limiting normality of c¢,(r(X)) follows from the use of Theorem 6.6 in 
Chapter 6. 


7.5] THE EFFICIENCY OF A RANK TEST 291 


From (5.4) and (5.5) it follows that we can write 


(5.7) ¢,(t,d) = `. =, exp [ite,(r) — itd — d?/2]Efexp [dec;Z,,]} 
= > exp (—itd — d?|2)£,Efexp [ite,(r) + d£c;Zo]} 


1 
= exp (—itd — d?/2) =i E,Efexp [(it + dJ)EeZ9] 


exp [—it(Ze,Zy,) — e1(0)]} 
= exp (—itd — d?/2) Efexp [(it + Ec Zn)] 


exp (—itU,)} 


where 
n n 


U: = > Zop — 1 = >, (Zop — E Zep), 


i=l i=l 
and where R = (R, ***, R,,) designates a random variable which takes 
each permutation of (1, - + +, n) with the same probability 1/n!. From the 
definition of R it follows that (Zir, ` * * Zer) is a random sample of n 
n 


from the standardized normal; hence > «Zany has the standardized 


isi 
normal distribution. Now, letting 0 stand for a complex-valued quantity 
with absolute value less than one, we can rewrite (5.7) 


(5.8) ,(t, d) = exp (—itd — d?|2) Efexp [(it + d)EcZnyl(1+0| Up} 
= exp (—itd — d?/2) exp [(it + d)?/2] + exp (—itd + d?/2) 
Efexp [(it +d)c;Z,ry] 9|tUR|} 
= exp (—12/2) + 0 |t [exp (—a?/2) E{| Ur [exp [42 c.Z.n,)]}- 


By Schwarz’s inequality we obtain 
E{| Ug| exp [d2c,Znp]} < EURIH IE{ exp RdZeeZ ny} P 
= [E{UR}P [exp (44/2? 


= [F(U RM À. 
Therefore 
(5.9) | ba(t, d) — exp (—PID]| < |¢| exp 7/2) E{UR} 
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The convergence and uniform convergence of (5.6) are now obtained by 
showing that lim E{U}} = 0. We have E{[Ec,Z,p)]2} = 1 and E{Zc,Z,)} 
= Ec E(Zg) = c (r); therefore 


(5.10) EUR} = E{[EcZiry— c (R)]?} 
= 1 — 2E{Ze;Z,p,¢,(R)} + E{(c,(R))?} 
= 1 — E((e(R))}. 

Then, by formula (6.5) in Chapter 6, 

(5.11) EXfe(R)P} = (n — 1I)7E(EZ,y)?. 


This last expression stands for the variance of the c statistic as calculated 
under the hypothesis of randomness. This variance approaches the 
limiting value one as was proved by Terry [9]. It also obtains from a 
general theorem by Hoeffding which we quote at the end of this section. 
It follows then that (5.11) approaches one, (5.10) approaches zero, and the 
limiting normality follows from (5.9) and the succeeding remarks. 

We complete this section by quoting a theorem proved by Hoeffding [10]. 
Let Xi, Xa ttt be independent and each have the same distribution 
function F(x). Also let (£a), +++, £m) be the order statistic for the first 
n«’s. Then the theorem is concerned with the distribution obtained when 
a value is chosen at random (equal probability) from the n numbers in the 
set {E(X)), °° +, E(X(,}.. As n becomes large, this distribution approxi- 
mates that given by F(x). We quote the theorem: 


THEOREM 5.1. (HOEFFDING). rf |z|dF(@) < c, and if g(x) is a real- 
valued continuous function bounded by A(x), where A(x) is convex 


and | h(x) dF(x) < œ, then 


5.12 tim LS j 
: im =- EX) = dF(2). 

By taking g(x) = 2" and applying the theorem to the order-statistic 
random variable (Za), * ++, Zew) for a sample of n from the standardized 
normal distribution, we obtain that 


n 


> IEZ) 


=i 


converges to the rth moment of the standardized normal distribution. 
With n = 2 this proves that (5.11) converges to one. 


7.6] PROBLEMS FOR SOLUTION 293 
6. PROBLEMS FOR SOLUTION 


1. Show that the one-sided sign test, Section 2.1 in Chapter 5, is consistent against 
alternatives for which the p percentile is positive. Also show that the two-sided sign test 
is consistent against alternatives for which the p percentile is not equal zero. 

2. For the two-sample scale problem (3.6) in Chapter 3, show that the unbiased test 
proposed in Problem 2 of Chapter 5 is consistent against the alternatives for which 


Pr {| Xn Xna l> lR- 


3. Consider the randomized-block problem, (4.3), (4.4), or (4.5) in Chapter 3, when 
the number of treatments is two. In Section 2.1 and Problem 6 of Chapter 5 the sign 
test was shown to be most powerful for formulation (4.5) against one-sided alternatives. 
Also the two-sided sign test has optimum properties—Problems 8 and 9 in Chapter 5. 
Against alternatives for which treatment differences in each block all have the same nor- 
mal distribution, show that the sign test has an efficiency 2/m with respect to the usual 
t test. 

4. The problem of randomness was described (3.3) and (3.4) in Chapter 3. Let 
Xi = Ed; + Z; (i = 1, +++, n) where the Z, +++, Za are independent, each Z; has the 
Same absolutely continuous distribution, and d,, +*+, d,, are given constants. Consider 
the case where the d,’s are equally spaced and occur in order of magnitude, and hence 
Without loss of generality can be replaced by the integers 1, +++, n. For the parametric 
Class of distributions corresponding to each Z, having the same normal distribution, 
Compare the efficiency of the following tests. For convenience let 


lÈ EUO) =o] 


aae var {1(X)}|s=0 


then the efficiency of 1, re fy is given by 
. (ty) 
th, te) = lim ——. 
eln nc P(ta) 
(a) The difference-sign test (Moore and Wallis [5]). The difference sign test is based 
on D, the number of positive first differences in the sequence £;, +, Xa. Using the 
function c'(u) defined by (2.10), we can write 


n 
D= > C'(£i — Ti). 


n—1 
The one- and two-sided tests are to reject for large values of D and 


Dea 


2 > 


Tespectively. Prove that 
r(D) ~ 3n/z. 
(6) The difference-sign correlation coefficient test. A test can be based on the difference- 
Sign correlation coefficient, t, between the x sequence and the d sequence. See (5.27) in 
Chapter 6. Show that, for our special form of a sequence, f can be written 


40 
5 n(n — 1)” 


t= 
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where 
Q= > — z;). 
i<j 
Prove that 
r(Q) ~ n?l4r. 


(c) The rank correlation coefficient test. The rank correlation coefficient k’ was defined 
in Problem 11, Chapter 6. Show that the coefficient k’ for the z and d sequences can be 
written 


pais 12y , 
n(n? — 1) 
where 
V= Du — I)e(x; — x). 
i<j 
Prove that 


r(V) ~ 1/47. 


(d) The turning-point test. This test proposed in [6] is based on the number of runs 
up and down or equivalently on T the number of “peaks” and “troughs” in the x 


sequence: 
n 
T= > T, 
i=3 

where 

T=] (E = EiaTii — tie) < 0 

= =l otherwise. 

Prove that 


È ED o=o =0. 


(e) The t test. The usual £ test is based on the statistic 
_ E(x; — (d; — d) 
O Eee " 
Prove that 
r(b) ~ 3/12. 

5. The two-sample problem was described in Section 3.1 of Chapter 3. Let Xp", 
Xn, be independent and each have an absolutely continuous distribution function F(x) 
on the real line. Also let P EER M Xn +n be independent and each Xn,+; have ea 
absolutely continuous distribution function G(x) on the real line. For the parametric 
class of distributions corresponding to normal distributions with the same variance but 
different means, find the relative efficiency of the following tests. Use the function r(t) 
defined in Problem 4, and for convenience and no loss of generality consider the normal 
distributions with variance 1, first sample mean 0, and second sample mean £. 

(a) The t test. The t test is based on the statistic 


i — x 

nı Ng 1/2 
De-a + > enua 
1 1 

m+ ny —2 
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where 
2 = nz Zz, 


Cane Days 
Show that 
1 nylig 


l/m + 1/m 7 n + Ny 


r). = 


(b) The Mann- Whitney test. The Mann-Whitney test was defined in Example 1.1 in 
Chapter 5 and can be based on the statistic ¥*, 


ny na 
Vt = > E (Ents — x), 
i=l j=1 
where c’(w) was defined in (2.10). Prove that 
Mylia y 12 
2713 na(n, + m + 1) 


(c) The median test. The median test is based on the number, u, of first-sample 
values smaller than the median, z, of the combined sample. Make the inessential 
Testriction that the combined sample size is odd, say equal to 2r + 1. By using the 
hypergeometric distribution, show that the joint probability density for x, z is given by 


r(V*) = ( 


É ý mu dF. 
hlu, 2) = m i f i i ” Jroa — F(2))"*71G(z)-"(1 — G(2))" zo 
Ma (te = 1) ua G- — Gararen- CEO | 
+a (™)("~ roa -roroa -co = 


Show how that (u, 2) has a limiting bivariate normal distribution. For this let 
u =n, F(c) + lv, 
z=c+ wal, 


where c satisfies 


m +n 
m F(c) + m G(e) = = 


Use Stirling’s formula, and work with the logarithm of the density element. The 
quadratic form of the limiting distribution is 


n; 


ot [ 1 ; 1 | 
FO — F()) © nz GOU — G(e)) 


FAQ) g© 
a a — FO) Ged — Tl 
fo) i §*(c) 


oa l — F(o) GOU — Gle)? 


where f(x) = dF(2)/de, g(x) = dG(x)/dx. Show that under the hypothesis the large- 


Sample variance of v is given by 
yla 


4(m + m) ` 
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Hence show that 
A(n, + m) nylig : 
r(v) [ J ] : 
yna (27)? (m, + n) 


6. For the theory in Example 4.2 find the covariance matrix of Z. From this, by 
using Theorem 2.8 in Chapter 6, show that (s¥(GX),s*(G’X)) has the limiting distribution 
of two independent x° random variables with c — 1 degrees of freedom. 

7. The problem of randomness with regression alternative was defined in Section 3.3 
of Chapter 3. In Section 5 of this chapter a rank test, the c, test of Terry, was proved 
asymptotically as efficient as the standard ż test. Prove that the following randomized 
rank test also has limiting efficiency one. Proceed as in Terry’s test, but use the random- 
ized statistic 


Ze; — OZ yr 
ct (r) = el , 
Re, — oy" 
where (Zan * + *, Zm) is the order-statistic random variable for an independent sample 


of n from the standardized normal distribution. 


REFERENCES AND BIBLIOGRAPHY 


1. E. L. Lehmann, ‘Consistency and unbiasedness of certain nonparametric tests,” 
Ann. Math. Stat., Vol, 22 (1951), p. 165. 

2. A. Stuart, “Asymptotic relative efficiencies of distribution-free tests of randomness 
against normal alternatives,” J. Am. Stat. Assoc., Vol. 49 (1954), p. 147. 

3. W. Hoeffding, “The large sample power of tests based on permutations of observa- 
tions,” Ann. Math. Stat., Vol. 23 (1952), p. 169. 

4. W. J. Dixon, “Power functions of the sign test and power efficiency for normal 
alternatives,” Ann. Math. Stat., Vol. 24 (1953), p. 467. 

5. G. H. Moore and W. A. Wallis, “Time series significance tests based on signs of 
differences,” J. Am. Stat. Assoc., Vol. 38 (1943), p. 153. 

6. W. A. Wallis and G. H. Moore, “A significance test for time series analysis,” 
J. Am. Stat. Assoc., Vol. 36 (1941), p. 401. 

7. E. J. G. Pitman, “Significance tests which may be applied to samples from any 
population. III. The analysis of variance test,” Biometrika, Vol. 29 (1938), p. 322. 

8. W. Hoeffding, “The large sample power of Fisher-Yates rank tests,” unpublished. 

9. M. E. Terry, “Some rank order tests which are most powerful against specific 
parametric alternatives,” Ann. Math. Stat., Vol. 23 (1952), p. 346. 9% 

10. W. Hoeffding, “On the distribution of the expected values of the order statistics, 
Ann, Math. Stat., Vol. 24 (1953), p. 93. 


Index 


Absolute continuity, 12 

Additive partition function, 255 

Admissible decision function, 39 

Almost invariant functions, 96 

Auxiliary function of a confidence region, 
113 


Basis for c-algebra, 30 

Blocks, 152 

Borel sets, 3 

Boundedly complete statistic, 24 


cı test, 195, 203, 248, 250, 289, 296 
Central limit theorem, Bernstein, 214 
Liapounoff, 213 
Lindeberg and Lévy, 213 
Characteristic function, 10 
„f a confidence region, 111 
x" test of fit, 126 
Complement, 2 
Complete class, of decision functions, 40 
of measures, 23 
Completeness, 23 
total, 173 
Complete statistics, 23 
for combined experiments, 26, 27 
order, 28 
Condensation, 3 
Conditional expectation, 15, 16 
Conditional probability, 12, 15 
by steps, 148 
Confidence region, 109, 110 
Consistent estimators, 143 
Consistent sequences of tests, 108, 267 
Convergence in probability, 208 


297 


Convergence of distributions, 207 
by characteristic functions, 209 
by moments, 208, 209 

Convex function, 50 
strictly, 50 

Correlation coefficient, 179 
difference sign, 232, 234, 261, 293 
grade, 259 
rank, 234, 247, 258, 260, 294 
serial, 257 

Coverage, 150 

Critical region, 70 

Cumulants, 139, 144 


Decision, 36 

Decision function, 37 

Decision space, 36 

Degree of a parameter, 136 

Difference-sign test, 293 

Disjoint, 5 

Distance for distribution functions, 127 

Distribution, uniform within intervals, 
28 

Distribution-free tolerance region, 116, 
117 

Distribution function, 7 

Dominated class of measures, 19 


Efficiency, criterion of, 273 
of test sequences, 270 
Ellipsoid of concentration, 51 
Envelope power function, 103 
Estimable parameter, 61, 136 
Estimators, 47 
median unbiased, 49 


298 INDEX 


Estimators, minimum variance, 62, 64 
to improve, 57 
unbiased, 48 
use of complete statistics for, 61 
use of sufficient statistics for, 57 
Expectation, 11 


Fit, problem of, 126 
Fundamental lemma of hypothesis test- 
ing, 72 


General linear hypothesis, 133 
Gini’s mean difference, 145, 230 


Halmos and Savage, 20 
Homogeneous set of measures, 21 
Hunt and Stein lemma, 106 
Hypothesis, 69 

composite, 73 

simple, 72 


Independence, problem of, 129, 178, 184, 
202, 204, 205, 234, 247 

Information, 17 

Integral, 10 

Intersection, 2 

Invariant estimator, 67 

Invariant loss function, 67 

Invariant partition, 96 

Invariant test functions, 95, 106, 107 

Invariant transformations, 66 


k statistics, 144 

Kernel, 136 

Khintchine’s theorem, 212 
Kolmogorov’s theorem, 213 
Koopman, 21 


Least favorable measure, 80 

Least favorable parameter value, 79 

Lebesgue integral, 9 

Lebesgue measurable sets, 6 

Lehmann-Scheffé theorem, 61 

Likelihood ratio, 196, 204, 205 

Linear hypothesis, 122 

Location, problem of, 128, 167, 171, 179, 
202, 274, 281, 293 

Location and symmetry, problem of, 
129, 202, 204 

Loss function, 37 


m-dependence, 215 
Mann-Whitney (Wilcoxen) test, 162, 
193, 200, 203, 235, 268, 295 

Marginal probability measure, 8 
Maximal invariant function, 96 
Maximal invariant partition, 96, 98 
Measurable function, 4 
Measurable sets, 2 
Measurable space, 2 
Measure, 5 

bounded, 6 

extending, 6 

unbounded, 6 
Median test, 295 
Minimax decision function, 39 
Minimax risk test functions, 107 
Moments, 139 
Most stringent tests, 104 


Neyman criterion, 20 

Neyman structure, 88 
Noether condition, 236 
Noncentral F distribution, 123 
Nonparametric statistics, 125 
Nuisance parameters, 22 


Operating characteristic, 41, 70 . 
Order statistic, 27, 33, 139, 187 
Outcome, 1 


p-lim, 208 
Parameter, 17, 46 
Parameter space, 17 
Pitman test, 178, 201, 249, 283, 284, 289 
Power function, 44, 71 
Power of a confidence region, 112, 113 
Probability, 5 
Probability density function, 19 
Probability measure, 6 
induced, 8 
Product sets, 3 
Product space, 3 
probability measure on, 7,8 


Radon-Nikodym theorem, 13 

Random experiménts, 5 

Random variable, 9 

Randomized block problem, 132, 170, 
202, 203, 262, 263, 283, 293 

Randomized confidence region, 110 


INDEX 


Randomness, problem of, 129, 181, 260, 

293 
with regression alternative, 130, 203, 

286, 289, 296 

Rank correlation test, 204 

Rao-Blackwell theorem, 57 

Relative efficiency of tests, 108 

Ring, 30 

Risk, 38 

Risk function, 45 


Sample space, 1 
Serial statistic, 221 
Several-sample problem, 130 
o-algebra, 2 
Sign test, 169, 200, 202, 276, 293 
Statistic, 4 
Statistical independence, 8 
Stochastically larger random variables, 
160 
Sufficient statistic, 17, 20 
for combined experiments, 21 
generalized, 170 
Symmetric kernel, 136 


t test, 180, 294 

Test function, 70 
exact size a, 71 
most powerful, 72 


299 


Test function, similar size a, 87 
size a, 71 
unbiased size a, 93 
Tests, for composite hypotheses, 79 
for simple hypotheses, 72 
Tolerance region, B-expectation, 118 
distribution-free, 147 
for a proportion p, 116, 156 
Turning-point test, 294 
Two-sample problem, 129, 160, 162, 164, 
174, 188, 194, 197, 201, 202, 203, 
235, 249, 263, 264, 268 
discontinuities, 200 
k-variate, 202, 263 
Two-sample scale problem, 131, 
201, 293 


163, 


U statistic, 137, 223 

Unbiased estimators, 136 
of zero, 64 

Unbiased tests, 93 

Union, 2 


Vector, 18 


Wald—Wolfowitz condition, 236 
Wilcoxen test, 204 
Wolfowitz test, 200 


Form No. 3. 
PSY, RES.L-1 
Bureau of Educational & Psychological 
Research Library. 


The book is to be returned within 
the date stamped last, 


WBGP-59/60-51190-5M 


m 
a a ee S a ra, 


ti. 


1 PÈ rer | 

ey a. a 
Phage (ears 
jaye th ls ee 


